Improved Azure Active Directory availability

Ganesh Chauhan, Technical Support Specialist, Microsoft Azure

“As part of our ongoing effort to be as open and transparent as possible about the major work being undertaken to maintain availability, today we focus on Azure Active Directory. Microsoft Azure Active Directory (Azure AD) is a cloud identity service that offers secure access to more than 250 million active users each month, connects over 1.4 million exclusive apps, and processes more than 30 billion authentication requests each day. As a result, Azure AD becomes not just the biggest enterprise Identity and Access Management solution, but also without a doubt one of the biggest services in the world.

For all of their applications and services, our clients rely on Azure AD to maintain safe access. This means that for us, each authentication request is a crucial part of a mission-critical process. The dependability and security of the service are our identity team’s top priorities due to its vital nature and size. Our team has a continuous programme to raise the bar on dependability and security, and Azure AD is developed for availability and security by utilizing a truly cloud-native, hyper-scale, multi-tenant architecture.

Azure Active Directory: Fundamentals of availability

A service of this size, complexity, and mission-criticality must be engineered to be highly available in a world where everything we rely on can and does fail.

Following is a list of reliability principles that guide how we allocate our investments in resilience:

Our availability work uses a layered defence strategy to reduce the likelihood of a failure that is visible to customers as much as possible, scope down the impact of that failure if it does happen, and, finally, shorten the amount of time it takes to recover from and mitigate a failure as much as possible.

In the coming weeks and months, we’ll delve more deeply into how each of the principles is created, tested in practice, and illustrated through cases that benefit our clients.

Highly redundant

Global service Azure AD features automatic recovery and multiple internal redundancy levels. Azure AD is installed in more than 30 data centers worldwide, and it makes use of Azure Availability Zones when they are available. This figure is quickly increasing as more Azure Regions are set up.

Depending on your tenancy settings, every piece of data that is written to Azure AD is duplicated across at least 4 and perhaps 13 datacenters for durability. For longevity and to scale up capacity to handle the strain of authentication, data is replicated at least nine times within each data center. To give an example, this means that at any one time, our smallest region has at least 36 copies of your directory data available inside it. For durability reasons, writes to Azure AD cannot be finished until a successful commit to a datacenter outside of the region.

This approach gives us both durability of the data and massive redundancy—multiple network paths and datacenters can serve any given authorization request, and the system automatically and intelligently retries and routes around failures both inside a datacenter and across datacenters.

We regularly test the system’s resilience to the failure of the system components Azure AD is based on and perform fault injection to confirm this. To make that the system can withstand the loss of a datacenter with no impact on customers, this goes as far as regularly pulling out entire datacenters.

No single points of failure (SPOF)

As previously indicated, Azure AD is built with several levels of internal resilience, but our approach goes even further to include all of our external dependencies. Our no single point of failure (SPOF) principle illustrates this.

We don’t tolerate SPOFs in our multi-factor authentication (MFA), which includes SMS and Voice, in essential external systems like Distributed Name Service (DNS), content delivery networks (CDN), or Telco providers due to the importance of our services. We employ many redundant systems that are fully active-active configured for each of these systems.

For example, when a significant DNS provider recently experienced an outage, Azure AD was completely unaffected because we had an active/active path to a backup provider. This is an example of how much of the work on this principle has been completed over the past calendar year.

Elastically scales

With over 300,000 CPU Cores, Azure AD is already a sizable system that can depend on huge Azure Cloud scalability to dynamically and quickly scale up to meet any demand. This can include both naturally occurring traffic increases, such as a 9AM peak in authentications in a specific region, as well as enormous surges in new traffic handled by our Azure AD B2C, which supports some of the biggest events in the world and frequently experiences rushes of millions of new users.

Azure AD overprovisions its capacity to provide an additional layer of resilience, and it was designed in such a way that the failover of an entire datacenter does not necessitate any additional capacity provisioning to accommodate the redistributed demand. This provides us the confidence to know that, in the event of an emergency, we already have all the resources we require on hand

Safe deployment

Safe deployment makes ensuring that modifications (code or configuration) move progressively from internal automation to internal to Microsoft self-hosting rings to production. We employ a fairly graduated and gradual ramp-up of the proportion of customers exposed to changes in production, with automated health checks gating advancement from one ring of deployment to the next. A modification must be fully implemented across production over the course of more than a week, however this procedure allows for an instantaneous rollback to the previous known healthy state.

This technology routinely detects possible failures in our “early rings,” which are wholly internal to Microsoft, and stops them from spreading to rings that would affect customer/production traffic.

Modern verification

Azure AD generates an enormous amount of internal telemetry, metrics, and signals utilised to monitor the health of our systems in order to enable the health checks that ensure safe deployment and provide our engineering team with information about the systems’ state. These signals, which feed our automatic health monitoring systems, amount to more than 11 PetaBytes on our scale each week. These technologies then set off alerts to automation and our staff of engineers, who are available around-the-clock 365 days a year to address any potential decline in availability or quality of service (QoS).

Our goal is to broaden that telemetry so that it can offer metrics that accurately reflect the whole health of a given situation for a certain tenant, not just the health of the services themselves. Our team has already set up internal alerts for these indicators, and we’re considering ways to make this per-tenant health information available to users directly in the Azure Portal.

Fine-grained fault domains and partitioning

The compartments in a submarine that are built to withstand flooding without compromising the integrity of the other compartments or the entire vessel serve as an excellent example for Azure AD.

The counterpart for Azure AD is a fault domain; in a fault domain, the scale units that service a group of tenants are designed to be totally separated from scale units in other fault domains. These fault domains offer solid isolation of a variety of failure types such that a defect’s “blast radius” is contained inside a specific fault domain.

Up until now, there have been five distinct fault domains in Azure AD. This number will rise to 50 fault domains by the end of next summer, and many services, such as Azure Multi-Factor Authentication (MFA), are moving to become completely isolated in those same fault domains.

This final attempt at hard partitioning is intended to limit any outage or failure to no more than 1/50 or 2% of our users. In the upcoming year, we want to grow this even more to hundreds of fault domains.

A preview of what’s to come

The aforementioned guidelines aim to strengthen the foundational Azure AD service. Because of the importance of Azure AD, we won’t stop there. Future posts will discuss additional investments we’re making, such as launching a second, fully fault-decorrelated identity service in production that can offer seamless fallback authentication support in the event that the primary Azure AD service fails.

Consider this to be comparable to a backup generator or uninterruptible power supply (UPS) system that can offer protection and coverage in the event that the primary power grid is affected. This system, which is currently in production and protects a portion of our crucial authentication flows for a set of M365 workloads, is entirely smooth and transparent to end users. Its applicability will be progressively expanded to include more scenarios and workloads.