In the Name of Security!

Following on from my blog entry about data protection laws, I felt it would be appropriate to cover some aspects of data security. These concepts are important for data analysts, engineers and scientists to know at least the basics (or be aware it’s something to seek help about).

As the old saying goes, “a stitch in time says nine”. A little bit of extra work to have good security practices can go a long way.

I’ll discuss some of the key elements more detail below.

1. Access

Within access, there are two main facets:

Authentication (AuthN) - are you who you say you are?

Authorization (AuthZ) - are you allowed to access this data?

Authentication (AuthN)

In terms of AuthN, most people would be familiar with username and password. However, in light of the cloud and people generally having many (even hundreds) of different accounts, Federated Identity Management is becoming more and more popular.

Federated Identity Providers (IdP) are trusted entities that handle the AuthN for a system. It is analogous to having the ‘trusted person’ in the group handle all the identifications - if they say that person is who they are, everyone else is satisfied. Some examples of major IdP include Okta and Google. The ‘federated’ part comes in when the IdP is like the ‘identifiers of all identifications’.

One ID to rule them all Photo by Porapak Apichodilok from Pexels

From the perspective of the user, it’s easier to remember 1 password than 100. From the developer’s perspective, letting a trusted IdP handle identification is much safer than doing it yourself and potentially getting it wrong. However, importantly, federated IdP only authenticate users, and do not authorise users (discussed below).

Authorization (AuthZ)

AuthZ is important in the age of cloud and data privacy - just because a person is verified to be able to access a system doesn’t mean they access everything. The important concept is the Principle of Least Privilege (POLP) - you should only have just enough access to do the work required. Often this access is time-based (e.g. with tokens), so the credentials will expire after a set amount of time.

A good example of this in action is AWS’s data centres, no one has blanket access to all parts of the data centre. Anyone who needs access to do certain work only gets time-limited access to a specific part of the data centre. Once they are done, their access expires.

This ties onto another major concept, Role-Based Access Control (RBAC). Rather than granting users accesses, the security and access policies are attached to roles. The roles in turn are then granted to users/groups.

RBAC is common in cloud systems, such as AWS Identify and Access Management (IAM) and Window/Azure Active Directory (AD) groups. For example, only the role ‘Security Admin’ has access to see all the audit logs of a production system. The role is given to Steve Smith, who is the head of security and in the ‘Security Staff Group’.

However, he subsequently gets seconded to a DevOps role, so therefore the role is removed from the user group. Therefore, he no longer has access.

RBAC is an example of fine-grain security policies - ideally you want your access to be as limited as possible. For example, even if a person worked in the Finance Department it may not mean they have read access to the entire finance database.

2. Encryption

Data can potentially be ‘snooped’ upon by hackers and using algorithms and keys to encrypt the data means if it is intercepted, the data can’t be read without the right key.

There are two types of encryption - encryption at rest (when the data is sitting in storage, such as a database) and encryption in transit (when the data is being transmitted).

Securing Data! Photo by Pixabay from Pexels

It is important to always check whether your data is encrypted both at rest and in transit. Also consider regularly rotating your encryption keys.

Many cloud services have this feature as a toggle that can be easily turned on - for example for S3 buckets, SQS queues and Kinesis Firehose streams.

3. Securing public-facing infrastructure and assets

All the encryption, AuthN and AuthZ won’t help you if you leave the front door open. Importantly, any public-facing assets generally need to be tested to ensure they are secured properly. This is generally known as penetration testing and essentially involves hiring/contracting ‘white hat’ hackers to deliberately try to hack your infrastructure.

Don’t leave your keys in the front door! Photo by PhotoMIX Ltd. from Pexels

If the hackers know the infrastructure details in advance, this is known as ‘white box’ testing and if not it is ‘black box’ testing. Furthermore, if the organisation’s security staff don’t know the test is coming, it is known as a ‘blind’ test.

What role do Cloud Providers play?

Cloud service providers are responsible to a certain extent for cloud-based infrastructure. AWS, for example, has the Shared Responsibility Model, in which:

AWS is responsible for security of the cloud - for example, securing the physical hardware servers that power the data centres

Customer is responsible for security in the cloud - for example, configuring firewalls for your virtual machine servers

How much responsibility the cloud service provider has depends on how ‘abstract’ the service is:

Infrastructure-as-a-service (IaaS) - you are basically renting servers so you are responsible for many things such as operating system patching, network firewalls, etc.

Platform-as-a-service (PaaS) - the cloud provider takes care of the underlying infrastructure, but you need to secure access to your platform (e.g. admin account for a SQL Server) and set up encryption etc.

Software-as-a-service (SaaS) and ‘Abstract Services’ - the cloud provider takes care of most of the infrastructure, so the user’s main responsibility is access (e.g. Dynamics 365 CRM in the cloud)

Cloud! Image by Gerd Altmann from Pixabay

4. Automated Security Tests

Many of the above can be automatically tested as part of a CI/CD pipeline. That is, running tests to see whether encryption is on, keys are rotated. For example, AWS Config has automated rules that will send an alert if certain rules are breached (e.g. root account keys are not deleted).

Some things to consider:

Secure your network with firewalls and locking down unused ports

Place more sensitive in private subnets which are not routable (ie accessible) via the Internet. Also consider accessing these resources via bastion servers.

Ensure data stores (databases, S3 buckets and network drives) are not directly publicly accessible via the Internet - ideally through APIs with authentication, security and throttling in place

Restrict access to systems to certain whitelisted IP addresses or VPNs

Regularly update operating systems and applying security patches

Ensure all public-facing systems have secured access (e.g. SSH) and root/admin accounts are either deleted or secured with Multi-Factor Authentication

Use time-based access tokens to access systems rather than persistent access keys/passwords

Rotate access keys/passwords regularly

Ensure encryption (SSL, TLS, HTTPS) is on to ensure logins are secured

My checklist in my prior blog also covers some aspects of best practices for security for AWS serverless resources.

5. Do not hardcode secrets and passwords

It is very tempting to do this when you want to do something quick and dirty - just put the password in your code/script! While it may save a few minutes of annoying coding, you will inadvertently commit your code to a git repository...then you are in big trouble.

You then make your git repo public! You panic and realise you can’t reverse commits and people can see your password even if you have deleted them in your latest commit!

Moral of story: DO NOT hardcode secrets. A good way is to either:

Store them in environment variables and only pass them in during runtime

Store them encrypted using asymmetric encryption (e.g. AWS KMS) and decrypting it on read. That way, even if the encrypted password is compromised, they won’t be able to decrypt it without the key.

6. Don’t keep Sensitive Data in your Code

This one is commonly adhered to in most architectural patterns - the concept of separation of concerns means your code base shouldn't have any sensitive data inside.

The business logic and processing is in your code, but the actual sensitive data resides in the data store. That way, even if your git repository or source code is compromised, at least sensitive data will not be compromised.

It is a bit trickier with algorithms and more sensitive business logic, in which case a more multi-tiered/microservices architecture may help. You keep the sensitive business logic in a separate service accessible via API endpoint.

7. Audit Logging for Access to Data

It is important to have access logs to data to ensure make no unauthorised access occurs. As discussed above, automated security alerts can pop up if unusual/unauthorised access is detected.

A good example of this is AWS CloudTrail, which keeps logs of all accesses to resources.

Data Centre Keeping those logs in those servers! Photo by panumas nikhomkhai from Pexels

8. Follow a checklist!

Cloud service providers have provided some good checklists and guides to follow - they are free of charge and good reference material. Doesn't hurt to try them!

Closing Thoughts

I like to sometimes blog about more ancilliary topics related to data - it is nice to be refreshed about data protection and privacy considerations.

Hopefully this blog gives you a little bit of a flavour to data security and what to watch out for.

Lock It Up - Keeping Your Data Safe with Security Best Practices

The prevention is definitely better than the cure!