Today, we want to explain what happened lately with our network in Amsterdam.
Between Friday the 15th and Wednesday 20th, a large portion of our users in Amsterdam suffered from network troubles. We know that you placed your trust in our services, and an outage like this is unacceptable. We would like to apologize and assume full responsibility for the situation. We'd like to share all of the details about this event.
Last Friday, we started to receive notifications from our customers related to network performance issues.
Our engineering team started to investigate right away but nothing stand out. The uplinks of the Amsterdam platform were nearly full but we were not able to reproduce the dramatic performance issue reported.
The same day, the software powering our NAT layer was upgraded to support a new IP range as our current IP pool was completely empty.
Following the upgrade, two NAT server were not restarted, our initial assumption was that this was strengthening the bandwidth shortage. We restarted these servers to increase the NAT capacity and continued to monitor the situation.
Bandwidth usage was still high but links were not completly full.
Tracking abusive usage to reduce bandwidth usage
When looking in depth to bandwidth usage, we saw that the traffic didn't have the usual ratio. The links were full in the backbone to Scaleway direction. Usually, servers tend to have an opposite ratio, they are pushing content toward the internet and not downloading content from the internet.
What we found out, is that several users where using 10s of Gb/s of bandwidth to massively download from Youtube.
We suspended several accounts infringing our terms of service to reduce bandwidth usage and free up capacity.
We initially thought that the issue was solved but it was not.
The true root cause: going down to Ethernet
On Wednesday, we received a strange report saying that traffic was not flowing between AWS and our Amsterdam Scaleway platform. We started to investigate and were able to reproduce the issue: UDP and TCP packets with a small payload were detected with an invalid checksum.
In fact, the root cause of the "performance issue" is way more trickier that initially thought. The network issue is mainly related to an Ethernet frame padding issue on the router side triggering invalid checksums.
The Ethernet Frame Padding Bug
To understand what happened, we need to go back to the basics of IP, Ethernet and how the data is encapsulated for transmission. The encapsulation of IP packets is defined in RFC 1042, "A Standard for the Transmission of IP Datagrams over IEEE 802 Networks". This RFC provides instructions on how to handle IP packets smaller than the minimum data field size (46 bytes) required by the Ethernet standards. They require that “the data field should be padded (with octets of zero) to meet the Ethernet minimum frame size”. This padding is not part of the IP packet and should not be included in the total length field of the IP header; it is simply part of the link-layer.
The problem we faced is initially related to a Cisco bug where Nexus 9000 switches do not properly implement this RFC (described here by our head of network). When a Nexus 9000 removes the VLAN from an Ethernet frame where the payload size is below 46, it should adjust the payload and add 0 bytes. Instead, it adds random bytes which in-fine causes small TCP and UDP packets to have an invalid checksum when computed with certain NICs.
The symptom: invalid TCP and UDP checksums on tiny packets
Accounting with the size of the IP/TCP/UDP headers, the only case where a packet will match this case is when the TCP payload is below 6 bytes or the UDP payload is below 18 bytes. As our software operates network address translation (NAT), the UDP and TCP checksum need to be recomputed as they cover the source and destination IP address. For performance reasons, server NICs do calculate the checksum with the padding. This is valid because the checksum computation algorithm is designed in a way that additional 0s at the end of the payload will not change the checksum. When the RFC is respected and the padding is full of 0s, the checksum is not impacted. This is where the Ethernet Frame Padding Bug starts to impact: the padding is not filled with 0s and the checksum ends invalid.
The software regression
We worked around this Cisco bug nearly two years ago by implementing a software workaround at the NAT layer where we do fill the padding with 0s. This software workaround has a performance cost and it was recently removed from our source code tree as the bug was supposedly patched on the routers. In fact, it wasn't and this caused all tiny TCP and UDP packets to have an invalid checksum.
Why we failed
We already identified several factors which triggered to this failure:
- internal communication issues between our support and engineering teams. Even with 10s of report, no appropriate reaction was triggered. From the engineering point of view, this looked like a minor, customer specific, configuration issue.
- overconfidence. We changed the software powering the NAT layer without carefully monitoring the situation afterwards. We didn't realize that this part of our infrastructure was the root cause as we placed too much trust in it.
- lack of qualitative monitoring on the network traffic. The volume of traffic didn't significantly drop but a finer analysis of the traffic would have revealed the issue.
We have identified several actions to avoid such an outage to reproduce:
- add specific test to our monitoring system to specifically test this kind of network flows (small packets)
- improve our tooling to observe traffic pattern modifications
- significantly improve the way we handle our users feedbacks on this kind of issues by increasing transparency and communication quality between our engineering and support teams
We wanted to share all the details with you today so that you can understand what truly happened. We will continue to analyse our processes to improve them.
We are committed to provide the best infrastructure on earth and will deploy all our energy to avoid such an issue to reproduce.
The entire Scaleway team, once again, apologize for the impact of the outage.