Cloudflare Admits Software Update on November 14 Caused Permanent Loss of 55% of Customer Logs
On November 14, 2024, Cloudflare's log service experienced a malfunction that prevented the output of logs. Despite the engineers' efforts to rectify the issue, the log service was interrupted for 3.5 hours, resulting in the permanent loss of approximately 55% of logs that cannot be recovered.
Log services are crucial for network services as they allow for the analysis of access data, troubleshooting, and identifying potential malicious attacks. Therefore, the failure of the log service is considered a serious issue.
In the latest incident report released, Cloudflare acknowledges that the main cause of the malfunction was an error in the software update deployed, which prevented Cloudflare Logs from correctly sending log information to customers.
Due to the vast amount of data typically involved in logs, Cloudflare uses a tool named Logpush to divide logs into predictably sized packets. These packets are then pushed to customers at a reasonable pace for analysis.
On November 14, Cloudflare engineers made changes to Logpush to support an additional set of data. However, this modification had a critical flaw: they forgot to instruct tools like Logfwdr to push these logs to customers. As a result, although the logs were collected, they were not pushed to customers for storage and were permanently lost after the log caches were cleared.
The software update was rolled back by Cloudflare engineers just five minutes after deployment when the issue was discovered. However, this triggered another error in Logfwdr: in the chaos of Logpush, log events for all customers were pushed into the system, not just those who had configured Logpush scheduled jobs.
This led to an enormous volume of logs, causing an anomaly in the Cloudflare Logs service and resulting in the complete loss of a vast number of log files. These lost logs were neither pushed to customers for storage nor saved by the Cloudflare system, effectively disappearing.
Cloudflare has apologized for the incident and stated that deployment plans are in place to prevent such events from happening again, although the work is not yet fully completed.