On April 7 at 4:35 pm UTC, Scaleway encountered a major incident in the FR-PAR-1 Availability Zone that impacted our Load Balancer product. A part of the Load Balancer infrastructure was unavailable for the duration of the incident. As a result, the Database, Kubernetes Kapsule, and IoT Hub products, which rely on Load Balancer as part of their infrastructure, were impacted as well.
The issue was identified and contained by 5:18 pm UTC by both the Network Products and Users & API teams.
The Scaleway Incident process was triggered, prompting all actors to come together and coordinate the tasks required to restore the affected services, and make sure communication would run smoothly. Having been working remotely for over a year, we did not gather in a physical room but rather in a virtual one, with several channels of communication remaining open overnight to ensure that information flowed effectively and accessibly during those crucial moments.
Restoration of the services started right away. Most of the Load Balancer services were restored by 8:34 pm UTC, and a couple of corner cases kept our Site Reliability and Product Engineers team busy until 0:04 am UTC on April 8. The duration of the main outage was almost 4 hours, and the overall incident lasted for 7 hours and 29 minutes.
We want to apologize for any inconvenience caused by the outage, and we thank you for your patience during this unavailability period. Your data was secure and protected at all times, and no data loss occurred.
This blog post aims to explain the details of the impact, the root cause of the incident, the steps we took to resolve it, and the measures taken to avoid a similar problem from happening in the future.
The incident impacted several Scaleway products.
For Load Balancer, only 1574 Load Balancer instances belonging to 775 customer organizations were disabled before we contained the issue and started the recovery procedure. These Load Balancers and their backends were inaccessible for the duration of the incident. After the incident, all Load Balancer resources were restored to normal.
For Database, up to 50 Load Balancers were impacted, and the access to the corresponding Databases was lost (up to 500). During the incident, the data was completely safe but was not available. During the restoration period, Database creation and “Allowed IPs” updates were impossible. Backups remained available and exportable at all times.
For Kubernetes Kapsule, 229 clusters belonging to 207 customer organizations were impacted. Kapsule leverages Load Balancers as part of its infrastructure between the cluster nodes and the control plane. During the incident, customers who deactivated the Auto Healing feature were unable to contact their control planes, but their instances and services running on cluster nodes were still available. Customers using the Auto Healing feature lost their service as the control plane started creating new nodes but could not reach them because of the Load Balancers' unavailability.
For IoT, 33 customers were impacted. As IoT Hub relies on Load Balancer, Database, and Kubernetes Kapsule, the service was unavailable during the incident. After the incident, all IoT Hub customer resources were restored to normal.
Root cause and resolution of the issue
The incident was caused by a manual call to the Load Balancer Trust and Safety (T&S) API requesting the deletion of a malicious user's resources. This specific call was not part of any usual workflow; it consisted in a crafted request that was supposed to issue an error. Unfortunately, a bug in the API, introduced earlier during the implementation of the Projects feature, caused a bypass of safety checks and triggered an avalanche invalidation of Load Balancer instances.
The incident timeline can be found here on the Scaleway Status page.
The call was made on April 7 at 4:35 pm UTC, and alarms were triggered into our internal monitoring channel at 4:45 pm UTC.
The team immediately started the containment and recovery procedure.
|April 7, 2021 at 04:53 pm UTC||The Load Balancer API was put into Read-Only mode to avoid any further customer operations.|
|April 7, 2021 at 05:20 pm UTC||The Load Balancer configurations were restored from our internal database backup made one hour earlier. No data was lost as load balancers are stateless.|
|April 7, 2021 at 05:54 pm UTC||The Load Balancer instance healing process was launched.|
|April 7, 2021 at 08:10 pm UTC||1400 instances were successfully healed, 110 were still failing and required manual healing.|
|April 7, 2021 at 08:35 pm UTC||All Load Balancer instances were successfully restored. Some corner cases still had to be investigated.|
|April 7, 2021 at 10:12 pm UTC||The Database and IoT Hub services were back to normal. Some edge cases with a couple of Load Balancers and Kubernetes Kapsule were still being addressed.
- Custom TLS certificates were not available and had to be restored from a secure certificate store.
- Backend liveness detection failed due to IP filtering on the backend servers and the fact that the Load Balancer IPs changed.
|April 8, 2021 at 00:04 pm UTC||All Load Balancer instances were restored and fixed. All services were back to normal.|
How we will prevent it from happening again
After analyzing the incident, the following measures were immediately taken:
- The Load Balancer T&S API bug was fixed and additional tests were immediately added to the test suites.
- The T&S API test procedure was updated with additional inter-team checks and reviews.
- Kubernetes Kapsule now checks the Load Balancer state before starting auto-healing.
And the following will be implemented in the near future:
- Improve the T&S API implementation guidelines.
- Improve the Load Balancer T&S API test coverage and leverage coverage analysis tools.
- Deploy and develop tools to improve and accelerate the overall Load Balancer recovery procedure.
- The Database product, as part of its continuous performance improvement, is currently re-modeling its Load Balancer infrastructure to be less prone to failures.
In writing this blog post, we wanted to provide our users with a detailed understanding of the incident and how we detected, contained, and addressed it.
The problem was caused by a bug in our APIs that we were quickly able to detect and fix. Apart from fixing the issue itself, we identified a couple of improvement axes during the process. We have already implemented robustification measures and will continue to improve and expand upon these in the near future.
Customer data was secure and protected at all times, and no data loss occurred. The performance and reliability of our products is of utmost importance to us, and we are continuously working on improving our services.
We hope this communication was useful and helped you understand how we manage incidents at Scaleway. As usual, we are open to your feedback. Don't hesitate to contact us on the Slack community!