Cloudflare outage: Why automation turned against the web

The Cloudflare outage triggered by an internal software fault reveals how monoculture, automation, and gaps in contingency planning undermine digital resilience.

22 Nov 2025 14:41 IST

New Update

The Cloudflare outage of 18 November 2025 revealed just how dependent the digital world has become on a few powerful infrastructure providers. What began as an ordinary software update quickly spread into one of the most disruptive incidents in recent years. For hours, websites and applications worldwide were unreachable, displaying only blank screens or 500-series server errors.

Advertisment

Platforms like X (Twitter), OpenAI’s ChatGPT, Canva, and Uber all went down, along with countless smaller services that depend on Cloudflare’s network. Analysts later estimated that nearly a quarter of all websites worldwide were affected, an event that exposed the hidden fragility of the internet’s underlying structure.

Cloudflare’s role in the modern web cannot be overstated. It serves as the invisible backbone for millions of websites, providing content delivery, DNS resolution, DDoS protection, and web application firewalls. Nearly all traffic to Cloudflare-protected sites is routed through its edge servers before reaching the original application servers. This model improves speed and security but also creates a single point of dependency. When these internal systems failed, that dependency turned from strength to weakness, collapsing much of the internet’s traffic in real time.

Even services designed to report outages became victims of the same outage. Downdetector, which tracks service failures, was itself knocked offline because it relied on Cloudflare’s infrastructure. The company’s own status page was also unavailable, further deepening confusion and speculation.

Advertisment

Some assumed the company was under attack, while others thought it was a global DNS failure. In reality, the cause was entirely internal—a small but disastrous software error that multiplied through Cloudflare’s automated systems faster than anyone could stop it.

What Triggered the Cloudflare Outage?

At the heart of the failure was Cloudflare’s Bot Management system, which uses machine learning to distinguish between legitimate users and automated bots. This system relies on a configuration file, the “feature file,” which contains data points used to score traffic behaviour. On that morning, engineers made a change to database permissions within a ClickHouse cluster, one of Cloudflare’s internal analytics databases.

The modification was intended to tighten query security, but accidentally altered how data was retrieved. As a result, the query that generated the feature file began pulling duplicate data from database shards, producing a file more than twice its standard size.

Advertisment

Under standard conditions, the feature file contained about sixty machine-learning variables, well within the system’s maximum limit of two hundred. After the permission change, the file suddenly contained more than double that number. The new proxy engine Cloudflare deployed, known as FL2, had a built-in limit to prevent overload.

When the oversized file was loaded, the proxy hit an internal assertion in its Rust-based code and crashed. The impact was immediate: HTTP 5xx errors cascaded across the network. The outage spread automatically, because Cloudflare’s update system—the mechanism that allows global updates within seconds—propagated the faulty file to every edge data centre.

The pattern of failure was confusing at first. Services would briefly recover, only to go offline again within minutes. This was because the file was regenerated every five minutes, and depending on which database shard the query used, it generated either a good or a bad version. Each time the bad file propagated, a crash occurred, sending large parts of the network into a recurring cycle of collapse and recovery. When the engineers realised what was happening, they halted the generation of new files, replaced the corrupted one with a verified version, and redeployed it. While the recovery began almost immediately, it took several hours to restore full stability.

Advertisment

How the Outage Disrupted Business

The outage lasted most of the business day in some regions. Cloudflare observed the first disruptions around 11:20 UTC (16:50 IST) and confirmed near-complete recovery by late afternoon. Independent monitors reported the most severe downtime between 13:30 and 17:00 UTC (19:00 – 22:30 IST). As the company confirmed, this was one of its most serious outages in years.

Initially, Cloudflare’s team believed they might be facing a massive distributed denial-of-service attack. The fluctuating error rates resembled the load patterns of such an assault, and the simultaneous failure of the external status page reinforced that suspicion. However, internal log analysis soon revealed the actual cause—a failed configuration update in the bot management system. The company later described the event as “deeply painful” and issued a public apology.

What made this incident so disruptive was not only the bug itself but the interconnectedness of the internet’s architecture. Cloudflare’s network sits directly in the data path for millions of other systems. When its edge servers failed, everything layered on top of them failed too—from authentication systems and payment gateways, to AI interfaces and even government websites.

Advertisment

The Cloudflare outage was more than a temporary disruption. It was an architectural wake-up call about how much of the world’s digital infrastructure rests on trust in a few entities.

Some businesses tried to switch to backup networks or disable Cloudflare’s proxy function, but those options were limited. Many companies discovered that it was not easy to reconfigure their DNS settings or reroute traffic because those, too, were managed by Cloudflare. The outage did not just break websites; it exposed how deeply the world has built its digital foundations on a few shared layers.

The broader implication is that the modern internet has become a tightly coupled system. Each service depends on others to function, and those dependencies often overlap. This makes the system efficient, but also brittle. When one critical node fails, its effects ripple through dozens of others. Similar patterns were seen in previous cloud outages at Amazon Web Services, Microsoft Azure, and Google Cloud. Each time, the failure of a single internal component at a major provider triggers cascading failures that span industries.

Advertisment

In cybersecurity circles, this phenomenon is called the “monoculture risk.” Much like planting a single crop across vast farmland, relying on the same infrastructure everywhere means a single flaw can affect everyone at once. The convenience of centralisation has replaced true resilience with the illusion of it. Businesses often believe they are safer by outsourcing to trusted cloud providers, but they are merely shifting risk rather than removing it. The Cloudflare outage proved that even self-inflicted errors—not hostile attacks—can take down enormous portions of the web.

Impact Beyond Just Business Loss

The financial and reputational damage of the outage was significant. Many e-commerce sites lost entire days of transactions. Banks, logistics firms, and SaaS vendors saw operations grind to a halt. Studies indicate that a single hour of downtime can cost mid-sized companies hundreds of thousands of dollars, and for large enterprises, the losses can reach into the millions. Cloudflare’s own stock price dropped by more than 3% the same day.

Yet there was a more profound loss on the day—trust. Organisations had to explain to their users that their systems weren’t broken, but their provider was. For companies that pride themselves on reliability, that distinction offers little comfort.

Advertisment

The outage also raised questions about preparedness. Many firms discovered that their redundancy plans were purely theoretical. They had assumed that a provider of Cloudflare’s size could not fail catastrophically. When it did, few had real alternatives ready. Switching away from Cloudflare is not as simple as flipping a switch; DNS propagation delays and security dependencies make it difficult to reconfigure on the fly. Only a handful of organisations with preconfigured backup CDNs or alternate DNS providers could adapt quickly. Most had no option but to wait and watch.

From a technical perspective, the event highlights the paradox of automation. Cloudflare’s system is designed for speed—global configuration updates can be deployed within seconds. This capability is usually an asset when responding to new threats or optimising performance, but during this incident, it became the very mechanism that spread the fault worldwide. The same automation that protects the internet can also amplify human error to catastrophic levels.

Another revealing point is that Cloudflare’s safeguards were not inadequate in quantity but in expectation. Engineers had defined what they believed were safe boundaries—a file-size limit, a feature-count threshold, a controlled rollout mechanism—but none accounted for this particular interaction between a database query and the feature-file generation process. It is a familiar story in complex distributed systems: assumptions that hold under normal conditions fail under unusual ones.

The Cloudflare event reignited discussion about whether the internet should become more decentralised—distributing control among many smaller, interoperable networks rather than relying on a handful of global nodes.

Designing Resilience for Future Outages

The lesson is not to abandon automation or scale, but to design systems that contain their own failures. Cloudflare’s incident shows why every organisation should assume that even trusted vendors can fail. Building resilience now means planning for supplier failure as an ordinary part of risk management, not as an extraordinary event. Companies are beginning to rethink how their systems depend on others, aiming to ensure that a single provider’s disruption doesn’t cascade into total paralysis.

In practical terms, this shift involves greater diversification. Businesses are reconsidering whether to combine multiple content delivery networks or maintain backup DNS providers to preserve routing control. Some are moving key administrative tools and monitoring systems off the same infrastructure used for production, ensuring that they can still observe and communicate during crises.

Others are testing their disaster recovery playbooks through simulation exercises, intentionally disabling dependencies to verify that failover systems truly work. A growing number are mapping not only their direct dependencies but also those of their vendors, uncovering how many hidden links connect them back to the same few infrastructure giants.

These steps are neither simple nor inexpensive. Redundancy adds complexity, and managing multiple providers requires expertise. But the Cloudflare event demonstrated that the cost of unpreparedness can be far higher. It also reignited discussion about whether the internet should become more decentralised—distributing control among many smaller, interoperable networks rather than relying on a handful of global nodes.

Ultimately, the Cloudflare outage was more than a temporary disruption. It was an architectural wake-up call about how much of the world’s digital infrastructure rests on trust in a few entities. The speed and power of today’s internet come with hidden fragility; one internal error at a single company can break communication, commerce, and creativity for millions.

The path forward is not about avoiding mistakes—those will always happen— but about containing them when they do.

Resilience does not come from scale alone. It comes from thoughtful design, distributed responsibility, and a willingness to question assumptions that once seemed safe. The organisations that internalise this will endure the next global outage with minimal harm.

As engineers often say, the only systems that survive are those built to expect failure. Designing for that reality is no longer optional; it is the only path to lasting reliability in an interconnected world.

David-Sehyeon-Baek

The author is the Founder and CEO of PygmalionGlobal. He collaborates with multiple cybersecurity companies, including NPCore in South Korea, and engages with government agencies and conglomerates across Asia.