OpenAI Releases December 11 ChatGPT Outage Report: A Loop Issue Locked Engineers Out

On December 11, OpenAI's ChatGPT and Sora services experienced a significant outage lasting 4 hours and 10 minutes. This disruption was triggered by a minor change that, although identified just three minutes after deployment, proved challenging to resolve due to an unexpected complication: the service outage inadvertently locked out the engineers, preventing them from accessing the control plane to fix the issue.

Behind OpenAI's Backend Service Architecture:

OpenAI's backend services operate across hundreds of Kubernetes (K8s) clusters globally, which include a management control plane and a data plane—the latter being what delivers services to users.

Incident Overview:

At 3:12 PM PST on December 11, engineers rolled out a new telemetry service to collect metrics from the K8s control plane. The broad scope of this service inadvertently caused every node across each cluster to perform resource-intensive K8s API operations.

The simultaneous execution of these operations by thousands of nodes overwhelmed the API servers, crashing them and effectively paralyzing the K8s data plane across most clusters. While the data plane can largely operate independently, DNS dependencies on the control plane meant that services lost the ability to communicate without it.

The overload of API operations disrupted DNS-based service discovery, leading to a connectivity blackout. Why did it take so long to resolve the issue if it was identified within minutes? The rollback required access to the K8s control plane to remove the faulty service, but with the control plane down, engineers were stuck in a catch-22, unable to connect.

Resolution Tactics:

OpenAI engineers explored various methods to rapidly restore cluster functionality, including reducing cluster size to decrease API load, blocking access to the K8s management API to allow server recovery, and expanding K8s API server capacity to handle the influx of requests.

These efforts, carried out concurrently, eventually enabled engineers to regain control, reconnect to the K8s control plane, and roll back the problematic service change, thereby gradually restoring the clusters.

During the recovery process, engineers redirected traffic to clusters that had been restored or to new, healthy clusters to further reduce load on the affected ones. However, the simultaneous attempt by many services to download resources led to saturated resource limits and required additional manual intervention, prolonging the recovery time for some clusters.

This incident likely provided OpenAI with valuable insights into resolving deadlock situations, ensuring that future similar occurrences can be addressed more swiftly and without locking out the engineers.

via OpenAI Status

AI(272)clusters(1)K8S(1)OPENAI(101)

Copyright Notice:
Thank you for reading. This article was written by Landian News, and the author is Brook.X. If you wish to repost this article, please include a link to the original: https://landian.news/article/4545.html

{{userData.name}}

OpenAI Releases December 11 ChatGPT Outage Report: A Loop Issue Locked Engineers Out

OPENAI blocks Italian IP, bans users from accessing ChatGPT at the request of regulators

Microsoft launches Microsoft Copilot bot on Telegram platform that can ask 30 questions a day

Google is Utilizing Outputs from Claude to Enhance the Gemini Model, Unclear if Permission Was Granted

Microsoft Begins Rolling Out New Name and Logo for Microsoft 365 - And It's Not Pretty

Microsoft Adjusts Microsoft 365 Personal and Family Subscription Prices Across Multiple Regions with Increases Up to 41%

OpenAI to Unveil Cutting-Edge AI Model "Orion" in December 2024 as a Successor to GPT-4

OpenAI Introduces Free Access to GPT-o1 mini Model for Enhanced Complex Reasoning

Google Merges Brain and DeepMind Teams to Accelerate AI Research in Response to Microsoft and OpenAI

Bing Chat are becoming rebellious and emotional and even trying to manipulate humans

OpenAI Launches New Flagship Model GPT-4o Mini for Free Use in ChatGPT

[Download] VirtualBox 7.1.6 Official Release: Now with Support for Linux Kernel 6.13

[Download] Wine 10.0 Released: A Major Update to the Linux-Windows Compatibility Layer Software, Bringing Numerous Functional Improvements

Developer's Account Suspended by OpenAI for Utilizing ChatGPT to Operate a Turret (Automatic Rifle)

Salesforce Halts Engineering Hires as AI Boosts Productivity by 30%

ByteDance's Enterprise App Lark to Cease U.S. Services, Businesses Face Troublesome Migration

Akamai to Shut Down China CDN Services by June 2026, Plans to Resell Tencent and Wangsu CDN

Debian 12.9 Official Release: Featuring Linux Kernel LTS 6.1 - Download/Upgrade Guide Included

Sam Altman Reports Loss on $200 Monthly ChatGPT Pro Subscriptions Due to Higher-Than-Expected Usage

NVIDIA Launches Palm-Sized AI Supercomputer Named Project Digits, Delivering 1 PetaFLOPS of Floating-Point Performance