OpenAI Releases December 11 ChatGPT Outage Report: A Loop Issue Locked Engineers Out
On December 11, OpenAI's ChatGPT and Sora services experienced a significant outage lasting 4 hours and 10 minutes. This disruption was triggered by a minor change that, although identified just three minutes after deployment, proved challenging to resolve due to an unexpected complication: the service outage inadvertently locked out the engineers, preventing them from accessing the control plane to fix the issue.
Behind OpenAI's Backend Service Architecture:
OpenAI's backend services operate across hundreds of Kubernetes (K8s) clusters globally, which include a management control plane and a data plane—the latter being what delivers services to users.
Incident Overview:
At 3:12 PM PST on December 11, engineers rolled out a new telemetry service to collect metrics from the K8s control plane. The broad scope of this service inadvertently caused every node across each cluster to perform resource-intensive K8s API operations.
The simultaneous execution of these operations by thousands of nodes overwhelmed the API servers, crashing them and effectively paralyzing the K8s data plane across most clusters. While the data plane can largely operate independently, DNS dependencies on the control plane meant that services lost the ability to communicate without it.
The overload of API operations disrupted DNS-based service discovery, leading to a connectivity blackout. Why did it take so long to resolve the issue if it was identified within minutes? The rollback required access to the K8s control plane to remove the faulty service, but with the control plane down, engineers were stuck in a catch-22, unable to connect.
Resolution Tactics:
OpenAI engineers explored various methods to rapidly restore cluster functionality, including reducing cluster size to decrease API load, blocking access to the K8s management API to allow server recovery, and expanding K8s API server capacity to handle the influx of requests.
These efforts, carried out concurrently, eventually enabled engineers to regain control, reconnect to the K8s control plane, and roll back the problematic service change, thereby gradually restoring the clusters.
During the recovery process, engineers redirected traffic to clusters that had been restored or to new, healthy clusters to further reduce load on the affected ones. However, the simultaneous attempt by many services to download resources led to saturated resource limits and required additional manual intervention, prolonging the recovery time for some clusters.
This incident likely provided OpenAI with valuable insights into resolving deadlock situations, ensuring that future similar occurrences can be addressed more swiftly and without locking out the engineers.
via OpenAI Status