In episode 4, we detailed some of the main security issues that are found in software, and it is no surprise that information leakage is one of the most frequent security flaws.
Laws and regulation
Data regulation exists around the world, with various laws, restrictions and agreements. Each country or group of countries inevitably starts making its own rules when it comes to personal information.
Case in point, we have the the GDPR (General Data Protection Regulation) in Europe, the LGPD (Lei Geral de Proteção de Dados) in Brazil, the PDP (Protección de Datos Personales) in Argentina, and the PPA (The Privacy Protection Authority, formerly ILITA) in Israel. Of course, there are more, each with different levels of personal data protection, regulation, and authorization.
When it comes to businesses working across borders, data regulation since the creation of the GDPR in 2016 has become complicated to say the least. A lot of questions have been raised, especially when it comes to data pipelines, anonymization, data mining, and machine learning.
There are many data protection rules, but if we want to do the best as we can to respect them, it really comes down to two very simple concepts:
- Anonymization: All personal information should be anonymized if possible. Of course, it is impossible to anonymize something like customer billing information, but when it comes to more generic data, anonymization is a n°1 rule.
- Justification: Gathering data may be important for your business, but is all data relevant? Unfortunately, the answer is often no. Keeping customer data because "it might be useful one day" is not a good enough reason, and is not compliant with regulation laws such as the GDPR. Know your data, know why you need it, and document your justification.
Of course, these rules might be "simple" to understand, but implementing them can very quickly become a nightmare and raise a lot of questions depending on the data processing you need to carry out.
Let's take an example
With the GDPR, any customer can ask for his personal information to be removed from any database at any time. Now, let's imagine that you have a customer's personal information in a database, and also another database with anonymized additional data.
If you don't have an association table, you cannot remove the anonymized data, but then, do you actually need to remove it if you cannot trace it to its original owner?
Let's go further, and imagine that this data is used in the training datasets of a machine learning algorithm which takes days to run. Are you supposed to delete this specific data entry from the training datasets, meaning that you will need to re-train your entire machine learning model?
Fortunately, we are given some latitude here since complying with everything at once is impossible for some businesses, and it is often considered that if data cannot be traced back to its original owner, this is "good enough" in terms of compliance.
But will it stay that way? What other regulation lies ahead of us and what solutions would we need to have to address them?
If you want to have a look at data regulation around the world, the CNIL (Commission Nationale de l'Informatique et des Libertés) provides a world map showing the different degrees of data regulation around the world.
Going back to the technical side of things, Veracode issued a whitepaper covering the biggest data breaches of 2020. The number of customers and companies impacted is beyond imaginable, and most them are barely known to the public.
It also shows that giant companies such as Microsoft or Nintendo are not immune to data breaches and security flaws, and that from small businesses to IT Giants, security should be, more than ever, everyone's concern.
The biggest breaches expose personal information publicly, ranging from personal communications to account credentials, and add up to billions of records over the course of 2020.
The data reveals that information leakage,
CRLFinjection, cryptographic issues, and code quality are the most common security vulnerabilities plaguing applications today. Fortunately, we know that through secure coding best practices, educational training, and the right combination of testing tools and integrations, developers are able to write more secure code from the start — which means producing innovative applications that avoid cyberattacks and reduce the risk of costly breaches.
citation source Veracode - The Biggest Data Breaches of 2020
Kubernetes and data management
How can data can be managed and protected in a Kubernetes environment?
Kubernetes is often described as stateless, meaning that it is not meant to host persistent data directly on the nodes. This is only logical, since nodes managed by Kubernetes can be auto-healed (i.e. replaced) automatically to ensure the health of the cluster, and can even be created or deleted thanks to the node auto-scaling feature.
Scalability, cost control, and constant cluster health checks come at a price, and this price is statelessness.
Persistent volumes and encryption
Statelessness does not mean that data cannot be stored while using a Kubernetes cluster, just that using the local filesystem of your cluster nodes is not the way to do it. That is where persistent volumes come into play, allowing you to use remote storage (of type
block / RWO or
nfs / RWM) and access it through your pods.
Persistent volumes can be used to store any kind of data, and even used as the storage system of a database, managed within a Kubernetes cluster. And as with any Kubernetes object, access to persistent volumes can (and should) be restricted.
Additionally, it is always good practice to encrypt the data you store. Most cloud providers' CSIs (Container Storage Interface) implement the encryption of persistent volumes.
Different policies for different data
In software development, we can identify at least three types of data, each of which has its own specificities and requirements, and each of which should be treated according to its nature and purpose. Using the same data storage for all types of data does not make any sense, as some will be read and/or written and/or updated often, some need specific indexation, and some might be present only for very few cases or needed only once in a while.
99% of software needs a database to store the information it displays, processes, or digests. In most cases, this data is stored in a relational database, separate from the production environment. Using a managed database-as-a-service allows organizations to rely on their cloud provider for database security, architecture redundancy, and backup policies.
Managing such databases can also be done within a Kubernetes cluster, relying on persistent volumes for data storage. However, redundancy and backup policies are not necessarily defined by default, and remain the responsibility of the customer. In the end, this solution has some drawbacks in common with a self-managed solution.
The data that is needed for a software to run is not to be considered critical. Of course, losing it would be painful as you would need to recreate all your datasets, but if you have a backup, even if it is a week old, it will probably be enough to save face with your customers.
Also, as long as no personal or confidential information has been leaked, your production environment and overall business will be as good as new in no time.
Personal information about customers is essential. How could you even prepare an invoice without this information? From the name of your customer, their location, to their credentials and credit card number, you need personal data. This data is critical and valuable, and it is the first information type concerned by data regulations such as the GDPR.
A leak of your users' personal data can have disastrous consequences - you stand not lose not only information, but also customers, trust and your good reputation. What's more, depending on the country, data regulation officers might come knocking on your door.
Often, user data is not stored any differently than software data, and the two most likely share the same database and the same access rights. They should not.
User data and software data have very different degrees of criticality, this is why they should benefit from relevant access restriction and security policies. Also, if there is one data type which needs encryption, it is user data.
Metrics, statistics, aggregated data, sensors, valuable data specific to each business...all of this is valuable data that can be processed in multiple ways.
From simple statistics to data transformation pipelines, and even machine learning algorithms, we transform our data through external services, one pipeline at a time, one third party after the other.
This data can be the added value you have compared to your competitors, and that is what makes it extremely important. Often, it requires dedicated storage because of its volume, format, and specific querying requirements (e.g. a search engine needs to read data efficiently whereas performance while writing data is not relevant).
Analytics data can be recomputed from a single source of truth: a database where raw anonymized data is stored. If the analytics data is lost, it can always be recomputed, with downtime of course, but downtime should never be critical.
Using persistent volumes to store such data in a Kubernetes cluster can be an interesting solution, as it allows the use of a pay-as-you-go (pay-as-you-grow?) model.
A true story - just for "fun"
To end on a real-life example, here is a little story about a breached database.
Once upon a time, a company was using Redis for caching data, installed on a dedicated server. The server was not well protected, nor was the Redis service. An attacker breached the server, deleted all the data there was on Redis, and inserted one piece of data to replace them all: "Not well protected, be careful next time". Two days later the server was transformed into a fortress.
Fortunately, no critical data was stored in Redis...