Privacy, privacy, privacy
In recent years, privacy rights and data protection has become a hot topic and legislation and regulations have up to the standards.
This blog post will explore how these laws affect day to day data work and what it means as a data professional to avoid getting on the wrong side of the law. Unfortunately, I need to give the usual legal disclaimer before I continue:
I do not hold myself out to be a lawyer and the below information does not constitute legal advice. It is simply my take on how the law applies to data-specific jobs.
- Privacy, privacy, privacy
- Summary of Data Protection and Privacy Laws
- What is ‘Personal Data’
- How Personal Data is Collected
- Using Personal Data only for its collected purpose
- Requiring measures to be put in place to secure personal data
- Where the personal data is processed and whether it is transferred offshore
- Explaining automated decision-making
- Closing Thoughts
Summary of Data Protection and Privacy Laws
What I’ll Cover
Before we dive into the practical aspects of data privacy and data analytics, engineering and science, I'll give an overview of two regimes: European Union's General Data Protection Regulations (EU GDPR) and Australia's Privacy Act 1988.
The California Consumer Privacy Act 2018 (CCPA) also provides comprehensive data privacy and protection regimes for Californians (USA), but this blog's main focus will be on the EU GDPR and APP.
What are the Data Protection Rules/Laws
The EU GDPR generally applies to organisations that offer goods and services to people/businesses in EU, while the APP generally only applies to Australian organisations (with a turnover of over 3 million AUD) and Australian government agencies.
However, the practical effect is much larger, as:
- Many major multinationals (e.g. Ebay, Amazon, Google) provide goods and services to people in the EU. These multinationals then need to do significant internal and external changes to ensure their entire global value chain is compliant with the EU GDPR.
- EU corporations need to consider the EU GDPR (Article 28 and 46) in the entire 'data chain', meaning third parties they contract with must also adhere to the EU GDPR rules. This means even companies in Australia with dealings with EU corporations may need to adhere to the EU GDPR under their contracts.
Note the Office of the Australian Information Commissioner (OAIC), the government agency that enforces the APP, has a good summary of the APP vs GDPR.
The APP and EU GDPR cover a few important aspects that has an impact of data analytics, engineering and science:
- What personal data can be collected
- What purpose the personal data can be used
- Requiring measures to be put in place to secure personal data
- Where the personal data is processed and whether it is transferred offshore
- Explaining automated decision-making and preventing bias and discrimination in these decisions
I'll now examine each point below.
What is ‘Personal Data’
Under the APP (paragraph B.85), personal information is any information or an opinion about an identified individual, or an individual who is reasonably identifiable.
Under the EU GDPR (Article 4), personal data is any information relating to an identified or identicable natural person.
Both these definitions encompass both directly and indirectly identificable individuals.
Below are some examples of what is considered 'personal data':
|Australian Privacy Principles||EU General Data Protection Regulations|
Personal Data is a wide definition
Personally, I learnt quite a lot about these technologies and like web beacons and tracking pixels. I also further learned that you can anonymise IP addresses when sent to Google Analytics!
Through data magic, not so identificable data becomes identificable
The underlying principle, however, is that if a collection of data points can identify an individual, then it is considered Personally Identifiable Information (PII) and therefore 'Personal Data'. This is further true if you are able to supplement your dataset with publicly available data.
The Information Accountability Foundation has helpfully categorised three types of 'created' Personal Data:
- Observed Data - recorded automatically, for example through online cookies
- Derived Data - generated from an original dataset, for example by calculating customer's preferences based on the number of items they bought
- Inferred Data - produced by using a more complex method of analytics, for example by predicting future health outcomes
The important takeaway is that 'personal data' is quite extensive and aggregate of not directly identifiable data can end up being 'personal data'.
Practically, this means the results of an exploratory data science project may be 'personal data'. For example, if you are doing foot traffic analysis in a shopping mall and combine multiple datasets of CCTV footage plus foot traffic to each store, you might be able to identify when a shopper returns to the same store. This may likely constitute 'created personal data'.
In these circumstances, it is recommended to de-identify this 'created' personal data.
How Personal Data is Collected
Consent is needed before collecting
Both regimes require consent from the individual before the personal data can be collected and used for a specified purpose.
Both the APP (B.35) and the EU GDPR requires informed and 'freely' given consent and must be a real choice. Informed means the individual must know who is collecting the data, what kind of data will be processed and how it be used.
That is, if you want to collect personal data for data analytics/science, you will need to be upfront and clear to individuals. Furthermore, in exploratory data science, if you start to use the data for an experiement not in the original scope of the consent, this may be 'function creep' that falls foul of the EU GDPR and APP.
In these discovery phases of data analytics/science projects, it is recommended to keep avoid technical details of data analytics activities and focus on how your data analytics activities relate to the main functions of your organisation.
Privacy Policies to inform individuals
- What type of personal data is collected and stored
- How they collect and store personal data
- Why (ie the purpose) for which they collect, store, use and disclose the personal data
- How and if the personal data is transferred offshore
The EU GDPR also further requires some additional information:
- How long personal data is stored
- The 'lawful' basis for processing personal data
The idea is to be upfront and transparent about how you are going to use the data. This means, as a data professional, before you collect data, you need to know what you intend to use it for and if you deviate off it, inform individuals.
For example, if you do A/B testing on a web application that will require collection of a new field from customers, you will need to ensure individuals are aware the data could be used for such analytics/experiments.
Using Personal Data only for its collected purpose
Both the APP (APP 6) and EU GDPR provide that organisations may generally not use personal information for a purpose other than the primary purpose it was collected for.
The balance between data stewardship, data democratization is further complicated by privacy laws. Within an organisation, you ideally want data to be shared freely, but you also do not want people to unintentionally misuse ‘protected’ datasets.
Many organisations have data catalogues and restrictions on personal data to ensure data users are aware the dataset is protected and can only be used for a strict purpose. In particular, the data catalogue needs to record where the data came from and the purpose of its collection.
Requiring measures to be put in place to secure personal data
De-identification and anonymization of personal data
Both regimes encourage the use of de-identification and anonymization of data. The EU GDPR (Recital 26) makes a distinction between:
- Anonymization - irreversibly destroying any way of identifying an individual
- Pseudonymisation - substitutes the identity of the data subject, meaning you need additional information to re-identify the data subject. It is reversible.
For the purposes of the APP, personal data that is robustly de-identified means it will not be ‘personal information’ subject to the APP or Privacy Act 1988. The information is ‘de-identified’ when the risk of re-identifying is very low, such as the removing of direct identifiers and safeguards/controls to prevent re-identification.
Furthermore, under the EU GDPR (Recital 26) anonymization means it is no longer personal data subject to the EU GDPR, as it's no longer possible to identify an individual with the data.
The APP (APP 2) also further gives individuals the right to deal with organisations anonymously or by pseudonym, meaning organisations cannot keep personal data about them.
Why Pseudonymize Data
The EU GDPR (Recital 78 and Article 25) considers pseudonymisation as a way to show GDPR compliance. Under the EU GDPR (Article 6), if the personal data is pseudonymized, there is more leeway to use it for additional purposes other than the original collection purpose, such as for data analytics, scientific, historical and statistical purposes.
There is less risk of a data breach having adverse affects on individuals if the data is pseudonymized. Hackers, for example, if the downloaded the dataset for registrations may just end up with a CRM GUID identifier that is meaningless without the more tightly guarded dataset linking GUIDs to email addresses.
What happens to customer data where they are no longer your customer?
This topic is quite interesting and really depends on what purpose the personal data was already collected. In particular, there's a few points worth noting:
- You may need to de-identify or destroy the personal data under the APP (APP 11.2) and GDPR's "Right to be Forgotten" (GDPR Article 17)
- You may need to transfer their data to a competitor in a 'structured, commonly used, machine-readable format' (ie portable format) (GDPR Article 20)
The practical implications are:
- Data lifecycle policies - archiving of personal data may need to involve de-identification.
- Data will likely need to be stored in open-source and commonly used formats, such as CSV, JSON, XML or Parquet. The implications may be that if you have data in proprietary systems, you will need to put in automated processes that allow data to be extracted.
Single View of Customer and Data Catalogues
A Single View of Customer is essentially all the information about a customer in one spot (ideally a single user interface).
Often at times, different types of data will be in different systems (e.g. billing may be in SAP, customer interaction and complaints in SalesForce CRM, revenue predictions in a tableau dashboard).
To add even more complication, the data lake architecture may mean unstructured data about customers is collected (e.g. chat history with a chatbot, call centre recordings).
Without a proper data catalogue and lineage, it would be very difficult to navigate the data lake to find all the personal data or an individual, reducing the lake to the notorious 'data swamp'.
What is interesting is that building a Single View of Customer is generally seen a 'data' or 'business' issue, as:
- Sales, Marketing or Customer Operations need to see all the customer data in one spot to make informed business decisions. For example, you wouldn't want the salesperson for gas retailing not know the person is already an electricity customer.
- Data science work on requires 'holistic' view of a customer to get the best results. For example, churn or retention analysis, customer segmentation analysis and predictive modelling. The more connection points, the better your model will be.
However, in a way, I would argue that data privacy laws almost mandates having a single view of customer for two reasons:
- GDPR (Article 5) and APP (APP 10 and 13) both require that personal data kept must be accurate and up to date
- GDPR (Article 15) and APP (APP 12) both allow an individual to make a request to access all their personal data. GDPR and APP both require this information to be made available within 1 month and 30 days respectively.
As a side note, the Australian Competition and Consumer (Consumer Data Right) Rules 2020 (AU CDR) will apply to the banking sector in late 2020 and the energy sector in the future. It will greatly expand on an individual's right under the APP (APP 12) to access data.
It would be very difficult to comply with these requirements if you cannot even find all the personal data about one customer. Therefore, the next time you see a 'Single View of Customer' issue, it could also have potentially data privacy implications too.
Notifying individuals of data breaches
Under Australia's Notifiable Data Breach Scheme (NDB Scheme), a data breach needs to be reported to the Austalian Regulator 'as soon as practicable' when a data breach is likely to result in serious harm to an individual whose personal data is involved. For example, if the personal data is mistakenly given to the wrong person. The only deadline is an assessment of whether the breach is serious needs to be done within 30 days.
The EU GDPR (Article 33) has a stricter standard and requires notifying the data breach to the EU Regulator within 72 hours unless the data breach is unlikely to result in a risk to the rights and freedoms.
Practically, this means an organisation subject to the EU GDPR only has 72 hours after a data breach to conduct an assessment and forensic investigation.
The key takeaway is controls and audit logging should be in place to ensure data isn't accidentally accessed or given to the wrong person.
An example is the Role-Based Access Control (RBAC) and Principle of Least Privilege approach to authentication (AuthN) authorisation and (AuthZ). In such approaches, blanket 'superuser' or 'admin' access is rarely granted (if ever) and only the least amount of access to granted to a role to perform its function.
Users are then assigned a role and accesses are not directly assigned to the user. This prevents unauthorised access when a user's role changes.
Practical Tips for Data Work
In light of all the measures required to collect, use, secure and store personal data, here are a few practical tips for data science and engineering work that may be useful.
In relation to privacy-compliant machine learning models:
- Anonymize all personal data as part of preprocessing - if the data isn't already anonymised, as part of the preprocessing stage, identifiers and other personal data are removed (e.g. customer id, name, age). This itself goesn't guarantee the data will not be personal data, but it reduces the risk.
- Federated (collaborative) Learning - the idea is rather than having the ML model run centrally and collect all the personal data, it runs on each user's devices. That way, the personal data never leaves their own device. Examples include the Python-based Deep Learning library, PySyft.
In relation to data warehouse/lake modelling (e.g. Kimball dimension modelling):
- Anonymize data in the ETL process before it arrives in the dimension and fact tables. Useful tools include data masking in SQL Server (e.g. making email@example.com into bXXXXXXX@aXXXXX.com)
- Differential Privacy - the idea is to de-identify, add noise and make small tweaks to the personal data so it retains all the key characterstics, while becoming de-identified.
A great example of this in action is Apple - they run ML models on iPhone keyboard inputs for their predictive texts, emojis etc. However, they add noise to the individual user inputs before it leaves the device, so its impossible to figure out who sent what emoji.
See my [blog post](/data_analytics/2022/02/07/data-privacy-in-practice.html) for more detailed techniques on using differential privacy.
- Use only surrogate keys in fact tables that cannot identify an individual without a dimension table.
- Restrict access to these dimension tables.
Where the personal data is processed and whether it is transferred offshore
Both the EU GDPR (Article 46) and APP (APP 8) have restrictions on the transfer of personal data outsides of its jurisdiction (EU and Australia respectively). The only exception under the EU GDPR (Article 45) is if the EU determines the country to have an 'adequate level of data protection'.
As a note, in a 2001 decision, EU does not consider Australia has a 'adequate level of protection' (i.e. the laws do not offer essentially equivalent protection as EU GDPR law). This is mainly due to Australia having exemptions to the APP for some types of data and small businesses.
However, regardless, essentially under both regimes, the organisation transferring the personal data offshore must take safeguards (e.g. contractual obligations, such as data protection clauses) to ensure the overseas receipent does not breach the privacy laws of the EU GDPR or APP.
Practically this means having a binding contractual term that requires the receipent to adhere to the EU GDPR or APP.
Under the APP, for example, this includes revealing personal information at an international conference or publishing personal data that is accessible by an overseas receipent.
If not, the organisation transferring the information offshore will be also be liable for any breaches of the privacy law. For EU GDPR, the transfer in itself will be considered a breach of the EU GDPR unless safeguards are taken.
The practical implications of this are:
- When using a cloud service provider (e.g. AWS, Azure) - where are you keeping personal data?
- If you are using cloud Software-as-a-service (SaaS) providers, such as DataRobot for automated machine learning, where is the data being processed?
- Are these organisations compliant with the EU GDPR or APP?
Explaining automated decision-making
The increase of automation and machine learning means many decisions can now be made without human intervention. For example, an automated system that will determine whether to approve a credit card application.
The risk of ‘black box’ systems has led to legislation that aim to increase the transparency and to prevent these systems to discriminate on the basis of personal factors (e.g. race, gender, age).
The EU GDPR (Article 22) expressly gives protection for individuals against automated processing/profiling, unless they explicitly consent to it.
Furthermore, The GDPR (Recital 71) also provides such automated processing/profiling must not have discriminatory effects based on personal aspects of the individual.
For example, an automated recruitment process which rejects an applicant on the basis of an analysis of prediction of their performance at work, economic situation, health, personal preferences or interests, reliability or behaviour, location or movements.
The APP (APP 10 and 12) indirectly addresses discriminations by requiring entities to verify the accuracy of personal data. The flow-on effect is also requiring:
- ensure analytics/algorithms and automated processes are operating appropriately and not creating biased, inaccurate, discriminatory, or unjustified results
- transparency about how analytic techniques and algorithms arrived at a decision
Furthermore, in Australia, anti-discrimination legislation (e.g. Age Discrimination Act 2014, Racial Discrimination Act 1975) does exist to safeguard against those scenarios where automated processing results in a bias/discriminatary outcome.
Practical Implications to Machine Learning Models
The practical implications as a data scientist are:
- Explainability of models is important - if an individual asks how the decision is made, the model should be explainable
- As part of model deployment/productionisation, you should consider including an automated explaining component. For example, if you already deploy your ML as a service (via API endpoint), you could have an additional API endpoint where you pass in a prediction ID and the explanation is returned.
- Features involving personal data should be carefully used - if you suspect adding them may generate a discriminatory effect, better to leave them out
- 'Kitchen sink' approaches to model training should be avoided - don't throw all the data you have in and train it, especially if you don't know where the data came from. In practice, strict data controls and data catalogues should prevent this approach.
- Address Over and Under Sampling - use open-source libraries like SMOTE to ensure address imbalanced datasets (e.g. 90% of loans rejected were people of a particular demographic).
- When automating existing processes, such as a claims process, consider checking whether there are inherent bias in the process. Otherwise the baseline and sample dataset will have bias, which will flow onto the final automated process. Conduct bias tests, such as demographic/statistical parity, as a way to check for inherent bias in the data (even if you have removed sensitive fields - e.g. race, religion). At times, sensitive fields are highly correlated to non-sensitive fields (e.g. majority of an ethnic group live in a particular suburb), which results in 'unaware' bias.
- Use open-source libraries such as SHAP, IBM AI Fairness 360 and other Explainable AI (XAI) and Bias detection techniques/libraries - they can assist with explaining both the model in general, as well as every prediction. It also makes it easier to explain the model to internal stakeholders.
- Finding the underlying causation - your target variable may be highly correlated with a personal factor, but the underlying causation may be with something else entirely (e.g. chicken consumption and number of cars may be highly correlated, but only because they are both correlated to the overall strength of the economy)
It has been an interesting exercise looking at data work through the lens of privacy laws. It sometimes is too easy to go down the proverbial 'rabbit hole' in data experiments and not realise the legal implications of such experiments.
Hopefully this blog gives you a little bit more insight about how privacy laws relate to data work.