Privacy, privacy, privacy

In recent years, privacy rights and data protection has become a hot topic and legislation and regulations have up to the standards.

As part of setting up this website with Google Analytics, Google Analytics Terms of Service (clause 7) actually requires you to have a privacy policy set up. Therefore, as I created my own privacy policy, it eventually sparked an interest in how data privacy and protection interacts with the technical aspects of data and IT work.

This blog post will explore how these laws affect day to day data work and what it means as a data professional to avoid getting on the wrong side of the law. Unfortunately, I need to give the usual legal disclaimer before I continue:

I do not hold myself out to be a lawyer and the below information does not constitute legal advice. It is simply my take on how the law applies to data-specific jobs.

You cannot run from the law! Photo by CQF-Avocat from Pexels

Privacy, privacy, privacy
Summary of Data Protection and Privacy Laws
- What I’ll Cover
- What are the Data Protection Rules/Laws
What is ‘Personal Data’
- Personal Data is a wide definition
- Through data magic, not so identificable data becomes identificable
How Personal Data is Collected
- Consent is needed before collecting
- Privacy Policies to inform individuals
Using Personal Data only for its collected purpose
Requiring measures to be put in place to secure personal data
Where the personal data is processed and whether it is transferred offshore
Explaining automated decision-making
- Legislation framework
- Practical Implications to Machine Learning Models
Closing Thoughts

Summary of Data Protection and Privacy Laws

What I’ll Cover

Before we dive into the practical aspects of data privacy and data analytics, engineering and science, I'll give an overview of two regimes: European Union's General Data Protection Regulations (EU GDPR) and Australia's Privacy Act 1988.

The California Consumer Privacy Act 2018 (CCPA) also provides comprehensive data privacy and protection regimes for Californians (USA), but this blog's main focus will be on the EU GDPR and APP.

What are the Data Protection Rules/Laws

The EU GDPR generally applies to organisations that offer goods and services to people/businesses in EU, while the APP generally only applies to Australian organisations (with a turnover of over 3 million AUD) and Australian government agencies.

However, the practical effect is much larger, as:

Many major multinationals (e.g. Ebay, Amazon, Google) provide goods and services to people in the EU. These multinationals then need to do significant internal and external changes to ensure their entire global value chain is compliant with the EU GDPR.

EU corporations need to consider the EU GDPR (Article 28 and 46) in the entire 'data chain', meaning third parties they contract with must also adhere to the EU GDPR rules. This means even companies in Australia with dealings with EU corporations may need to adhere to the EU GDPR under their contracts.

Note the Office of the Australian Information Commissioner (OAIC), the government agency that enforces the APP, has a good summary of the APP vs GDPR.

The APP and EU GDPR cover a few important aspects that has an impact of data analytics, engineering and science:

What personal data can be collected

What purpose the personal data can be used

Requiring measures to be put in place to secure personal data

Where the personal data is processed and whether it is transferred offshore

Explaining automated decision-making and preventing bias and discrimination in these decisions

I'll now examine each point below.

What is ‘Personal Data’

Under the APP (paragraph B.85), personal information is any information or an opinion about an identified individual, or an individual who is reasonably identifiable.

Under the EU GDPR (Article 4), personal data is any information relating to an identified or identicable natural person.

Both these definitions encompass both directly and indirectly identificable individuals.

Below are some examples of what is considered 'personal data':

Examples of 'Personal Data' provided by the relevant Regulators

Australian Privacy Principles	EU General Data Protection Regulations
Name, signature, address, phone number or date of birth credit information employee record information photographs	Name Identification Number Online identifier Video, audio, numerical, graphical, and photographic data
Geolocation information from a mobile device IP addresses Telecommunications 'metadata', including: subscriber and account details for telecommunications services and devices; information about the sources and destinations of communications; the date, time and duration of communications; the location of equipment or line used in connection with a communication	Geolocation data Online identifier Pseudonymous identifiers, including: Internet Protocol (IP) and MAC addresses Cookie identifiers, web beacon and tracking pixels RFID tags Device fingerprints
Sensitive information (includes information or opinion about an individual’s racial or ethnic origin, political opinion, religious beliefs, sexual orientation or criminal record), as well as some aspects of biometric information (e.g. fingerprint, iris, palm)	Special categories (including racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person (e.g. voice prints, facial recognition and fingerprints), data concerning health or data concerning a natural person’s sex life or sexual orientation)

Personal Data is a wide definition

After deep diving into all these privacy regimes, I discovered that even a simple act of using Google Analytics to analyse web traffic to this website included 'personal data'. The cookies and tracking technologies used are all considered to be collecting 'personal data' and therefore I had to get a privacy policy and cookie policy!

Personally, I learnt quite a lot about these technologies and like web beacons and tracking pixels. I also further learned that you can anonymise IP addresses when sent to Google Analytics!

It’s more than just driver’s licences Photo by Dom J from Pexels

Through data magic, not so identificable data becomes identificable

The underlying principle, however, is that if a collection of data points can identify an individual, then it is considered Personally Identifiable Information (PII) and therefore 'Personal Data'. This is further true if you are able to supplement your dataset with publicly available data.

The Information Accountability Foundation has helpfully categorised three types of 'created' Personal Data:

Observed Data - recorded automatically, for example through online cookies
Derived Data - generated from an original dataset, for example by calculating customer's preferences based on the number of items they bought
Inferred Data - produced by using a more complex method of analytics, for example by predicting future health outcomes

The important takeaway is that 'personal data' is quite extensive and aggregate of not directly identifiable data can end up being 'personal data'.

Practically, this means the results of an exploratory data science project may be 'personal data'. For example, if you are doing foot traffic analysis in a shopping mall and combine multiple datasets of CCTV footage plus foot traffic to each store, you might be able to identify when a shopper returns to the same store. This may likely constitute 'created personal data'.

In these circumstances, it is recommended to de-identify this 'created' personal data.

How Personal Data is Collected

Both regimes require consent from the individual before the personal data can be collected and used for a specified purpose.

Both the APP (B.35) and the EU GDPR requires informed and 'freely' given consent and must be a real choice. Informed means the individual must know who is collecting the data, what kind of data will be processed and how it be used.

That is, if you want to collect personal data for data analytics/science, you will need to be upfront and clear to individuals. Furthermore, in exploratory data science, if you start to use the data for an experiement not in the original scope of the consent, this may be 'function creep' that falls foul of the EU GDPR and APP.

In these discovery phases of data analytics/science projects, it is recommended to keep avoid technical details of data analytics activities and focus on how your data analytics activities relate to the main functions of your organisation.

Privacy Policies to inform individuals

Before Personal Data can be collected, both the APP (APP 5) and EU GDPR (Article 13) require entities to have a privacy policy to let individuals know:

What type of personal data is collected and stored

How they collect and store personal data

Why (ie the purpose) for which they collect, store, use and disclose the personal data

How and if the personal data is transferred offshore

The EU GDPR also further requires some additional information:

How long personal data is stored

The 'lawful' basis for processing personal data

Be upfront your data will be used for experiments Photo by Chokniti Khongchum from Pexels

The idea is to be upfront and transparent about how you are going to use the data. This means, as a data professional, before you collect data, you need to know what you intend to use it for and if you deviate off it, inform individuals.

For example, if you do A/B testing on a web application that will require collection of a new field from customers, you will need to ensure individuals are aware the data could be used for such analytics/experiments.

Using Personal Data only for its collected purpose

Both the APP (APP 6) and EU GDPR provide that organisations may generally not use personal information for a purpose other than the primary purpose it was collected for.

Photo by Pixabay from Pexels

Practically, this means in exploratory data science, you won't be able to use certain 'protected' datasets unless you obtain permission from the individuals (or you de-identify/anonymise the dataset). A way to do this is to include a statement in the privacy policy like 'analytics may be conducted using information from a range of sources, such as information collected from third parties'.

The balance between data stewardship, data democratization is further complicated by privacy laws. Within an organisation, you ideally want data to be shared freely, but you also do not want people to unintentionally misuse ‘protected’ datasets.

Many organisations have data catalogues and restrictions on personal data to ensure data users are aware the dataset is protected and can only be used for a strict purpose. In particular, the data catalogue needs to record where the data came from and the purpose of its collection.

Requiring measures to be put in place to secure personal data

De-identification and anonymization of personal data

Both regimes encourage the use of de-identification and anonymization of data. The EU GDPR (Recital 26) makes a distinction between:

Anonymization - irreversibly destroying any way of identifying an individual

Pseudonymisation - substitutes the identity of the data subject, meaning you need additional information to re-identify the data subject. It is reversible.

For the purposes of the APP, personal data that is robustly de-identified means it will not be ‘personal information’ subject to the APP or Privacy Act 1988. The information is ‘de-identified’ when the risk of re-identifying is very low, such as the removing of direct identifiers and safeguards/controls to prevent re-identification.

Furthermore, under the EU GDPR (Recital 26) anonymization means it is no longer personal data subject to the EU GDPR, as it's no longer possible to identify an individual with the data.

The APP (APP 2) also further gives individuals the right to deal with organisations anonymously or by pseudonym, meaning organisations cannot keep personal data about them.

Anonymous Photo by NEOSiAM 2020 from Pexels

Why Pseudonymize Data

The EU GDPR (Recital 78 and Article 25) considers pseudonymisation as a way to show GDPR compliance. Under the EU GDPR (Article 6), if the personal data is pseudonymized, there is more leeway to use it for additional purposes other than the original collection purpose, such as for data analytics, scientific, historical and statistical purposes.

There is less risk of a data breach having adverse affects on individuals if the data is pseudonymized. Hackers, for example, if the downloaded the dataset for registrations may just end up with a CRM GUID identifier that is meaningless without the more tightly guarded dataset linking GUIDs to email addresses.

What happens to customer data where they are no longer your customer?

This topic is quite interesting and really depends on what purpose the personal data was already collected. In particular, there's a few points worth noting:

You may need to de-identify or destroy the personal data under the APP (APP 11.2) and GDPR's "Right to be Forgotten" (GDPR Article 17)

You may need to transfer their data to a competitor in a 'structured, commonly used, machine-readable format' (ie portable format) (GDPR Article 20)

The practical implications are:

Data lifecycle policies - archiving of personal data may need to involve de-identification.

Data will likely need to be stored in open-source and commonly used formats, such as CSV, JSON, XML or Parquet. The implications may be that if you have data in proprietary systems, you will need to put in automated processes that allow data to be extracted.

Single View of Customer and Data Catalogues

A Single View of Customer is essentially all the information about a customer in one spot (ideally a single user interface).

Often at times, different types of data will be in different systems (e.g. billing may be in SAP, customer interaction and complaints in SalesForce CRM, revenue predictions in a tableau dashboard).

To add even more complication, the data lake architecture may mean unstructured data about customers is collected (e.g. chat history with a chatbot, call centre recordings).

Without a proper data catalogue and lineage, it would be very difficult to navigate the data lake to find all the personal data or an individual, reducing the lake to the notorious 'data swamp'.

What is interesting is that building a Single View of Customer is generally seen a 'data' or 'business' issue, as:

Sales, Marketing or Customer Operations need to see all the customer data in one spot to make informed business decisions. For example, you wouldn't want the salesperson for gas retailing not know the person is already an electricity customer.

Data science work on requires 'holistic' view of a customer to get the best results. For example, churn or retention analysis, customer segmentation analysis and predictive modelling. The more connection points, the better your model will be.

However, in a way, I would argue that data privacy laws almost mandates having a single view of customer for two reasons:

GDPR (Article 5) and APP (APP 10 and 13) both require that personal data kept must be accurate and up to date

GDPR (Article 15) and APP (APP 12) both allow an individual to make a request to access all their personal data. GDPR and APP both require this information to be made available within 1 month and 30 days respectively.

As a side note, the Australian Competition and Consumer (Consumer Data Right) Rules 2020 (AU CDR) will apply to the banking sector in late 2020 and the energy sector in the future. It will greatly expand on an individual's right under the APP (APP 12) to access data.

It would be very difficult to comply with these requirements if you cannot even find all the personal data about one customer. Therefore, the next time you see a 'Single View of Customer' issue, it could also have potentially data privacy implications too.

Notifying individuals of data breaches

Under Australia's Notifiable Data Breach Scheme (NDB Scheme), a data breach needs to be reported to the Austalian Regulator 'as soon as practicable' when a data breach is likely to result in serious harm to an individual whose personal data is involved. For example, if the personal data is mistakenly given to the wrong person. The only deadline is an assessment of whether the breach is serious needs to be done within 30 days.

The EU GDPR (Article 33) has a stricter standard and requires notifying the data breach to the EU Regulator within 72 hours unless the data breach is unlikely to result in a risk to the rights and freedoms.

Practically, this means an organisation subject to the EU GDPR only has 72 hours after a data breach to conduct an assessment and forensic investigation.

Data Breach! Photo by Negative Space from Pexels

The key takeaway is controls and audit logging should be in place to ensure data isn't accidentally accessed or given to the wrong person.

An example is the Role-Based Access Control (RBAC) and Principle of Least Privilege approach to authentication (AuthN) authorisation and (AuthZ). In such approaches, blanket 'superuser' or 'admin' access is rarely granted (if ever) and only the least amount of access to granted to a role to perform its function.

Users are then assigned a role and accesses are not directly assigned to the user. This prevents unauthorised access when a user's role changes.

Practical Tips for Data Work

In light of all the measures required to collect, use, secure and store personal data, here are a few practical tips for data science and engineering work that may be useful.

In relation to privacy-compliant machine learning models:

Anonymize all personal data as part of preprocessing - if the data isn't already anonymised, as part of the preprocessing stage, identifiers and other personal data are removed (e.g. customer id, name, age). This itself goesn't guarantee the data will not be personal data, but it reduces the risk.

Federated (collaborative) Learning - the idea is rather than having the ML model run centrally and collect all the personal data, it runs on each user's devices. That way, the personal data never leaves their own device. Examples include the Python-based Deep Learning library, PySyft.

In relation to data warehouse/lake modelling (e.g. Kimball dimension modelling):

Anonymize data in the ETL process before it arrives in the dimension and fact tables. Useful tools include data masking in SQL Server (e.g. making bob-smith@awesome.com into bXXXXXXX@aXXXXX.com)

Differential Privacy - the idea is to de-identify, add noise and make small tweaks to the personal data so it retains all the key characterstics, while becoming de-identified.
A great example of this in action is Apple - they run ML models on iPhone keyboard inputs for their predictive texts, emojis etc. However, they add noise to the individual user inputs before it leaves the device, so its impossible to figure out who sent what emoji.

See my [blog post](/data_protection/2022/02/07/data-privacy-in-practice.html) for more detailed techniques on using differential privacy.
Use only surrogate keys in fact tables that cannot identify an individual without a dimension table.

Restrict access to these dimension tables.

Where the personal data is processed and whether it is transferred offshore

Both the EU GDPR (Article 46) and APP (APP 8) have restrictions on the transfer of personal data outsides of its jurisdiction (EU and Australia respectively). The only exception under the EU GDPR (Article 45) is if the EU determines the country to have an 'adequate level of data protection'.

As a note, in a 2001 decision, EU does not consider Australia has a 'adequate level of protection' (i.e. the laws do not offer essentially equivalent protection as EU GDPR law). This is mainly due to Australia having exemptions to the APP for some types of data and small businesses.

However, regardless, essentially under both regimes, the organisation transferring the personal data offshore must take safeguards (e.g. contractual obligations, such as data protection clauses) to ensure the overseas receipent does not breach the privacy laws of the EU GDPR or APP.

Practically this means having a binding contractual term that requires the receipent to adhere to the EU GDPR or APP.

Under the APP, for example, this includes revealing personal information at an international conference or publishing personal data that is accessible by an overseas receipent.

If not, the organisation transferring the information offshore will be also be liable for any breaches of the privacy law. For EU GDPR, the transfer in itself will be considered a breach of the EU GDPR unless safeguards are taken.

The practical implications of this are:

When using a cloud service provider (e.g. AWS, Azure) - where are you keeping personal data?

If you are using cloud Software-as-a-service (SaaS) providers, such as DataRobot for automated machine learning, where is the data being processed?

Are these organisations compliant with the EU GDPR or APP?

Explaining automated decision-making

The increase of automation and machine learning means many decisions can now be made without human intervention. For example, an automated system that will determine whether to approve a credit card application.

The risk of ‘black box’ systems has led to legislation that aim to increase the transparency and to prevent these systems to discriminate on the basis of personal factors (e.g. race, gender, age).

Legislation framework

The EU GDPR (Article 22) expressly gives protection for individuals against automated processing/profiling, unless they explicitly consent to it.

Furthermore, The GDPR (Recital 71) also provides such automated processing/profiling must not have discriminatory effects based on personal aspects of the individual.

For example, an automated recruitment process which rejects an applicant on the basis of an analysis of prediction of their performance at work, economic situation, health, personal preferences or interests, reliability or behaviour, location or movements.

The APP (APP 10 and 12) indirectly addresses discriminations by requiring entities to verify the accuracy of personal data. The flow-on effect is also requiring:

ensure analytics/algorithms and automated processes are operating appropriately and not creating biased, inaccurate, discriminatory, or unjustified results

transparency about how analytic techniques and algorithms arrived at a decision

Furthermore, in Australia, anti-discrimination legislation (e.g. Age Discrimination Act 2014, Racial Discrimination Act 1975) does exist to safeguard against those scenarios where automated processing results in a bias/discriminatary outcome.

Automation - robots? Photo by Alex Knight from Pexels

Practical Implications to Machine Learning Models

The practical implications as a data scientist are:

Explainability of models is important - if an individual asks how the decision is made, the model should be explainable

As part of model deployment/productionisation, you should consider including an automated explaining component. For example, if you already deploy your ML as a service (via API endpoint), you could have an additional API endpoint where you pass in a prediction ID and the explanation is returned.

Features involving personal data should be carefully used - if you suspect adding them may generate a discriminatory effect, better to leave them out

'Kitchen sink' approaches to model training should be avoided - don't throw all the data you have in and train it, especially if you don't know where the data came from. In practice, strict data controls and data catalogues should prevent this approach.

Address Over and Under Sampling - use open-source libraries like SMOTE to ensure address imbalanced datasets (e.g. 90% of loans rejected were people of a particular demographic).

When automating existing processes, such as a claims process, consider checking whether there are inherent bias in the process. Otherwise the baseline and sample dataset will have bias, which will flow onto the final automated process. Conduct bias tests, such as demographic/statistical parity, as a way to check for inherent bias in the data (even if you have removed sensitive fields - e.g. race, religion). At times, sensitive fields are highly correlated to non-sensitive fields (e.g. majority of an ethnic group live in a particular suburb), which results in 'unaware' bias.

Use open-source libraries such as SHAP, IBM AI Fairness 360 and other Explainable AI (XAI) and Bias detection techniques/libraries - they can assist with explaining both the model in general, as well as every prediction. It also makes it easier to explain the model to internal stakeholders.

Finding the underlying causation - your target variable may be highly correlated with a personal factor, but the underlying causation may be with something else entirely (e.g. chicken consumption and number of cars may be highly correlated, but only because they are both correlated to the overall strength of the economy)

Closing Thoughts

It has been an interesting exercise looking at data work through the lens of privacy laws. It sometimes is too easy to go down the proverbial 'rabbit hole' in data experiments and not realise the legal implications of such experiments.

Hopefully this blog gives you a little bit more insight about how privacy laws relate to data work.

Sailing Through Data Privacy Waters - A Recap on How Data Protection Laws Work

A practical guide to privacy law in the context of data work

Privacy, privacy, privacy

Summary of Data Protection and Privacy Laws

What I’ll Cover

What are the Data Protection Rules/Laws

What is ‘Personal Data’

Personal Data is a wide definition

Through data magic, not so identificable data becomes identificable

How Personal Data is Collected

Privacy Policies to inform individuals

Using Personal Data only for its collected purpose

Requiring measures to be put in place to secure personal data

De-identification and anonymization of personal data

Why Pseudonymize Data

What happens to customer data where they are no longer your customer?

Single View of Customer and Data Catalogues

Notifying individuals of data breaches

Practical Tips for Data Work

Where the personal data is processed and whether it is transferred offshore

Explaining automated decision-making

Legislation framework

Practical Implications to Machine Learning Models

Closing Thoughts

Privacy, privacy, privacy

Summary of Data Protection and Privacy Laws

What I’ll Cover

What are the Data Protection Rules/Laws

What is ‘Personal Data’

Personal Data is a wide definition

Through data magic, not so identificable data becomes identificable

How Personal Data is Collected

Consent is needed before collecting

Privacy Policies to inform individuals

Using Personal Data only for its collected purpose

Requiring measures to be put in place to secure personal data

De-identification and anonymization of personal data

Why Pseudonymize Data

What happens to customer data where they are no longer your customer?

Single View of Customer and Data Catalogues

Notifying individuals of data breaches

Practical Tips for Data Work

Where the personal data is processed and whether it is transferred offshore

Explaining automated decision-making

Legislation framework

Practical Implications to Machine Learning Models

Closing Thoughts