Object Storage - How Is It Built? (3/3)

Build
Rémy Léone
6 min read

In the previous articles, we published a general overview of Object Storage as well as the logical workflow of get and put operations. In this article, we will go through the infrastructure design on which our object storage service runs.

What's under the hood?

The first challenge was to find the right balance between the network, CPUs and IOPS to optimize our investment. It was also necessary to estimate the power consumption of this infrastructure to meet our physical infrastructure constraints.

As a result, we used many CPUs with low consumption in order to achieve a higher density. On the network level, we used 200 Gb/s per rack.

A typical rack is composed of 38 servers called Storage Pods (SP), headed by
two Load Balancers (LB) and two Top-of-Rack switches (ToR). A storage pod contains 150 TB of capacity and has hot swappable disks.

The ToR are L3 load-balanced on two different routers. This provides under
nominal conditions 200 Gb/s of full duplex bandwidth, and 100 Gb/s in case of failure of one of the components.
Each LB (running HAproxy) is connected to a 100 Gb/s port on each ToR via their 40 Gb/s Network Interface Card (NIC), the links are L2 bonded to provide under nominal conditions 80 Gb/s to each LB.
The storage pods are connected to a 25 Gb/s port on each ToR, via their 10
GBits NIC, the links are L2 bonded to provide under nominal conditions
20 Gb/s to each SP.

Gateways are running with 256GB of RAM and with a connectivity of 100G.

Our booting process is based on PXE servers to have a server configuration totally idempotent. Once an image is started, further configuration is made using Salt.

Typical Object Storage Rack

Designed for resiliency

The second challenge we faced is the failure domain. Our goal is to avoid data unavailability caused by a faulty switch, a broken fiber or a rack that might have been disconnected.

This is why all our racks on the object storage are totally redundant at the network level. We have a minimum of two interfaces per server connected to two different switches, with an additional interface for management on a dedicated switch.
We have also established a minimum design of 3 to 5 racks, in order to allow us to tolerate the loss of a rack.

We tuned the density to the maximum of our racks by working with servers providing a capacity of nearly 150TB on only 1U.
We could have chosen servers with 4U providing nearly 1.5PB, but we preferred to limit our failure domain.
We have a capacity of 200Gb/s per rack and we are placed between the exit of the Internet and the rest of our infrastructure, so we are not directly impacted by any outage that could occur on the rest of our products.
Each bay also benefits from two local load-balancers in the rack, each load-balancer is a dual CPU machine, capable of supporting the associated network load.
Each load-balancer has two network interfaces tied together which allows us to tolerate the failure of an interface in a way that is invisible to the software.
All its servers are addressed via a IP ECMP which is declared in zone A of our DNS.

Failure domain of a rack
Failure Domain of Availability Zone

Failure Domain of Availability Zone

Maximum tolerable failure domain without loosing availability

Resiliency to failures

To summarize, below is a reminder of how we mitigate different failures types:

[√] Machine : Distributed design
[√] Switch : Bonding
[√] Drive : Erasure Coding
[√] Twinax/SFP : Double network attachement
[√] LoadBalancer: Dual Load-Balancer per rack
[√] network interface: Double network card
[√] Electricity : Dual electrical feed
[√] Region : We have two regions, the customer can duplicate the data

Daily life for a Storage-ops @Scaleway

We monitor the entire cluster on different aspects:

  • Hardware: We need to know at all times if all of our machines are up and which ones are down, we obviously go much further, by measuring the power consumption in real time or the rotation speed of our fans.
  • Network: We have graphs about the use and status of network cards, if ports are up, network interfaces and so on. We also benefit from the network monitoring that we have always had. Network monitoring is done with SNMP, which are sent to the Scaleway monitoring team as well.
  • Cluster/Application: We have information about the cluster, both at the system and application level. We have a segmentation on verbs and HTTP code, which allows us to quickly check the status of the service. The services logs are sent to an ELK cluster using a rsyslog agent.
  • DB synchronization: We calculate the replication times between our databases. As you are aware, we made the choice of Amsterdam for the launch of this Object Storage which was a great architectural challenge.
  • Usage: We naturally have usage graphs and fill rate graphs of the platform to ensure that you never run out of storage space.

Accounting

In order to have a reliable billing process, a reliable architecture must be
designed with the following constrains:

  • Count every request made by every client,
  • Evaluate the data usage in all clusters,
  • Have a small temporal delay (about ~1h),
  • Be decoupled from the main service to avoid global outages in case of billing error and have a minimal performances overhead.

At each request made by a client, the gateway will create an event. This
event will be processed by the Scaleway accounting stack, which will create a billing event, if necessary.

1. Requests accounting

Each S3 request received on the gateway generate an accounting event. The accounting system will increments counters which are reseted regularly, and periodically a billing event is generated (for example every 1 million GET Object request).

2. Data usage accounting

The basic idea is to pull regularly (e.g. every 15 minutes) the data usage by
each bucket in the cluster. Then regularly (e.g every day) the average data
usage is aggregated by the accounting system and then a billing event is created and sent to the billing system.

The main caveat with this simple architecture is that a frequent data usage
pull might have a very high-performance impact on the cluster. To avoid that, the idea is to pull only the buckets that have changed during a brief time period.

To know if a bucket has changed, the S3 Gateway will send an event to the accounting system every time a bucket is modified (PUT Object, DELETE Object…). To avoid any problem with old buckets (buckets that haven't changed in a long time), the system will automatically insert virtual bucket modification events every so often.

Bucket Uniqueness

One product requirement is to have a uniqueness of bucket names across all
the clusters in every region. To meet this demand, a central database is populated with the list of every bucket created.

The bucket DB should only be used in 2 scenarios:

  • During bucket creation, to ensure that the bucket name is not used anywhere
    else
  • Doing a GET Bucket location request which returns the region of the bucket.

The bucket DB is only contacted by the gateway.

During the design process, the team kept in mind that in the future this
Bucket DB might not only contain the bucket names but also other metadata
such as CORS information.

We use CockroachDB for this database to ensure good replication accross our different regions.

What's next?

We still have many challenges to face today and tomorrow. We are aware that the confidence in storage products is based on availability as well as durability.
It is the robustness of our infrastructure and our expertise on the subject that will make us earn the trust of our users.
We want to add more reliability mechanisms to ensure that even in the most challenging failure case, the availability of customer data will not be affected.

The S3 protocol being a standard created in 2007, we need to reach the level of features necessary to help companies in their cloud provider migration.
We need to work on our expansion in several regions to convince customers who are waiting for geo-replication features.
There are also advanced ACLs on buckets which is a very popular feature for our customers.
We are also working to upgrade the rest of our infrastructure to use this technology as well as upgrade some storage products that we already support.

Conclusion

We believe that object storage is one of the core building block of a modern cloud native application architecture.
In this series of article we dived deep into the reasons behind using an object storage, the internal lifecycle of a request and finally how we implement it at Scaleway.

Object storage is ready to use today! Create your first bucket now!

Recommended articles