Software resilience | Security updates

How Cloudflare's routine config change brought 20% of the internet to a standstill

One Cloudflare configuration change brought almost a quarter of the entire internet to a standstill in November 2025. Here's what concentration risk costs and what software escrow does about it.

By Ben Espach

Last updated: Jul 6, 2026

ON THIS PAGE

On November 18, 2025, Cloudflare suffered its worst service outage since 2019 — nearly six hours that took thousands of websites offline and disrupted services used by hundreds of millions of people worldwide. The cause was a single routine database maintenance change, and the risk it revealed is one that every software-dependent business shares: the concentration of critical internet infrastructure onto a handful of providers.

This article covers what happened, why this failure mode is systemic and recurring, and what businesses can do to reduce their exposure before it becomes their problem.

TL;DR

What happened: On November 18, 2025, a routine database access control change caused a core Cloudflare system to generate an oversized configuration file, triggering a crash that returned errors to every site the company serves globally — for a total impact window of five hours and 46 minutes.
Why it matters: Cloudflare handles approximately 20% of all global web traffic, so a single internal configuration error simultaneously disrupted platforms serving hundreds of millions of users across finance, commerce, transport, and social media, costing an estimated $1.58 billion in lost forex and CFD trading volume within the span of three hours.
What to do: Map every critical service routed through a single provider, pre-configure failover, and back your recovery path with software escrow — so that if a dependency fails or disappears entirely, you have verified access to everything needed to rebuild independently. Regulatory frameworks like DORA now require exactly this documentation, and the businesses that had it on November 18 had options; the ones that didn't could only wait.

A database permissions change took Cloudflare's global network offline for nearly six hours

At 11:05 UTC, Cloudflare's engineering team deployed what appeared to be a routine database access control change; the kind of maintenance task that doesn't cause any incidents on a normal day. The change triggered Cloudflare's Bot Management system — the feature responsible for classifying incoming web traffic as human or automated — to generate a configuration file more than double its normal size, exceeding a hardcoded memory limit in the core proxy layer and triggering a crash that pushed error responses to every website Cloudflare serves.

By 11:20 UTC, failures were spreading globally, and Cloudflare's engineers — trying to diagnose the sudden scale of the disruption — initially suspected a hyper-scale cyberattack, because the failure was too large and too fast to appear self-inflicted.

As the outage spread, every Cloudflare product dependent on its core proxy went down alongside it: the CDN, security services, bot verification, enterprise authentication, and the Cloudflare dashboard itself. That last failure compounded everything that followed: customers attempting emergency configuration changes found themselves locked out, and Cloudflare's own engineers couldn't access the internal tools they needed, because those tools depended on the same infrastructure that had failed.

The disruption reached well beyond Cloudflare's own systems: ChatGPT, X, Spotify, Discord, Coinbase, NJ Transit, France's national railway SNCF, and McDonald's ordering kiosks were among the platforms that confirmed disruptions. The outage also arrived days before Black Friday, with 63.3% of US e-commerce stores using Cloudflare's CDN running on Shopify during the most critical commercial window of the retail year.

Core traffic was largely restored by 14:30 UTC, three hours and 10 minutes after the first failures appeared, with full resolution confirmed at 17:06 UTC. In its post-mortem, Cloudflare CEO Matthew Prince described it as the company's worst outage since 2019, noting that in over six years, no event had caused the majority of core traffic to stop flowing through the network — a distinction that a single misconfigured internal database query had now claimed.

This is what third-party concentration risk looks like in practice

At the time of the outage, Cloudflare served 21.9% of all websites globally. And among websites that use any reverse proxy service (the intermediary layer that sits between users and websites, handling security, speed, and traffic routing), its share reached 82.5%. When a single provider reaches that scale of dominance, it stops functioning like a vendor and starts functioning like utility infrastructure, and when utility infrastructure fails, it fails for everyone on the grid at once.

The November 18 outage reached the foundational layer that thousands of unrelated businesses, governments, and services had built their operations on top of, in many cases without fully understanding how deep that dependency ran.

The financial consequences reflected that structural reality. Finance Magnates Intelligence estimated that forex and CFD brokers — firms trading Contracts for Difference, financial instruments where execution windows are measured in seconds — lost approximately $1.58 billion in trading volume during the three peak hours of the outage. Research cited by Forbes places the cost of a single hour of downtime for mid-size and large enterprises above $300,000, which means businesses that experienced the full core outage window faced exposure exceeding $900,000, and that's before factoring in the costs of SLA penalties, litigation, or reputational damage.

» Assess your own concentration risk exposure with Codekeeper's free risk assessment.

2025's Cloudflare outage generated over 3.3 million user-reported service disruptions on Downdetector — ranking it the third-largest outage by report volume recorded anywhere in the world across all of 2025.

Those 3.3 million reports represent the visible surface of a much deeper problem. Research from the past three years shows that 40% of organizations have experienced a major outage caused by human error, with staff failing to follow procedures contributing to 58% of those cases — and the Cloudflare outage fits that profile exactly: a procedurally routine change, deployed without adequate safeguards, propagated globally within minutes.

That kind of error isn't a corner case either. Seventeen days later, Cloudflare suffered a second outage on December 5 — a firewall configuration change that disrupted 28% of HTTP traffic and took Canva, Coinbase, LinkedIn, and Zoom offline. In its December post-mortem, Cloudflare acknowledged that both incidents shared a common root characteristic: the ability to push configuration changes globally within seconds, with insufficient gates to catch a bad one before it reached worldwide coverage.

Moving critical infrastructure onto a handful of providers has compressed risk into fewer, larger failure points. When one of those providers fails, the failure reaches every business, government, and service that built on top of it simultaneously — and with each year of consolidation, that reach extends further.

How to protect your software against concentration risk

Organizations with pre-configured DNS failover (a backup routing path that automatically redirects traffic to an alternative server when a primary provider goes down) were able to at least partially continue operations while waiting for Cloudflare to restore service. Organizations without one had no option but to wait through the full three-hour core outage window, entirely dependent on a provider's recovery timeline. Both groups faced the same outage. The preparation each had in place beforehand determined how they came out of it. Here's how you can prepare for similar scenarios:

Failure: Every business that went offline on November 18 had one thing in common: their operations depended on a provider whose internal failure cascaded directly to them, with no independent path to continue serving users. The practical response is to map your full dependency surface — understanding exactly which systems route through which providers — and pre-configure failover before you need it. Beyond that, software escrow ensures that if a provider goes down permanently or a software relationship ends, you have verified access to the source code, configuration files, and documentation needed to rebuild or migrate independently, rather than starting from zero.
Attacks: The November 18 outage wasn't a cyberattack, but it exposed the same structural vulnerability one would exploit: when CDN, authentication, and security services all run through the same provider, disrupting that provider takes all three down simultaneously. Cloudflare Access failing alongside the CDN locked users out of platforms and prevented engineers from reaching their own tools at the same time. Decoupling authentication from your CDN layer removes that single point of failure, and maintaining escrowed backups — stored in an immutable vault outside your production environment — ensures that a compromised or disrupted provider doesn't take your recovery path down with it.
Non-compliance: Regulators have been watching this concentration problem build for years, and November 2025 was the month the EU acted on it directly. DORA entered enforcement in January 2025 and published its first list of Critical ICT Third-Party Providers in November 2025, the same month as this outage. Under DORA, EU financial entities are required to register all ICT third-party providers, document concentration risk, and maintain exit strategies for critical dependencies. NIS2 extends comparable obligations across essential and important entities in other sectors. Escrow agreements and verified documentation from Codekeeper satisfy DORA, NIS2, and ISO 27001 audit requirements, giving compliance teams the documented proof of resilience those frameworks require.
Broken: The Cloudflare outage was caused by a configuration change that was never validated before it was pushed globally. And Cloudflare's own response was to build verification gates into every configuration deployment going forward. The same principle applies to your recovery assets: an escrow deposit that has never been tested is an assumption, not a guarantee. Codekeeper's verification testing confirms that escrowed source code, configurations, and dependencies are complete and buildable before a crisis forces you to find out otherwise.

If this happened to your business

At 11:20 UTC, your customer-facing platform starts returning errors. By 11:30, your support queue is filling with complaints, and your team is trying to log into the Cloudflare dashboard to investigate — except the dashboard is also down. You spend the next two hours on hold with providers who are fielding thousands of simultaneous calls, watching a status page that's intermittently unreachable because it runs on the same network that's failing.

By early afternoon, Cloudflare's core services begin recovering, and some of your dependencies start coming back online — piecemeal, at different times, in no predictable order. Without documented architecture and pre-configured recovery procedures, your team is manually checking systems one by one, unsure what's back and what isn't. Full operational status isn't confirmed until early evening — nearly eight hours after the first failure.

By then, the damage is already fixed in the record. Revenue stopped flowing for the duration of the outage. Customers who hit error pages during those hours didn't wait; they went elsewhere. If you operate under DORA or NIS2, the absence of a documented continuity plan and a tested recovery procedure means the outage lands as a compliance failure on top of a technical one. Being back online doesn't undo any of that.

» Build your disaster recovery plan before the next outage forces your hand

Cloudflare failed twice in 17 days, and the internet's growing consolidation makes the next failure a matter of when

The November 18 outage will not be the last event of its kind; Cloudflare's own engineers confirmed as much when a second configuration failure struck seventeen days later. The internet is moving toward greater consolidation, not less, which means the full extent of any single provider failure on your operations will only grow over time. The practical question for any software-dependent business has shifted to whether you've built enough independence from your critical dependencies to have options when this failure mode resurfaces.

» If you want to minimize your concentration risk exposure and build a recovery system that protects you when a dependency fails, software resilience is the right decision. Codekeeper's software escrow and verification solutions give you the autonomy to operate outside of your vendor's downtime.