Data Science/Analytics Pitfalls
So following on my prior blog post on common data science/analytics pitfalls, we’ll now focus on pitfalls more specific to machine learning!
Note that while focused on ML, these pitfalls are equally applicable statistical/data science work in general.
Heads up: this blog post is intended to be used as a ‘miniwikipedia’ for future references (hence the length). Feel free to skip use the table of contents to skip to the section that interests you.
 Data Science/Analytics Pitfalls
 Not formulating your question correctly
 Golden Hammer Bias
 Not evaluating your model properly
 Not using or developing a baseline model
 Not doing proper outsample testing
 Not having a failsafe or automated validation layer for ML/AI
 Using machine learning for nonstationary environments
 Using biased datasets/processes to train machine learning models
 Closing Thoughts
Not formulating your question correctly
Data science/analytics is about addressing realworld business problems  it’s easy to get caught up in all the data and magic.
Determine which type of data work this is
Before you begin any form of data work, you need to determine which type of outcome you are trying to achieve. IBM has identified 5 types of analytics:
 Planning Analytics  What is our plan?
 Descriptive Analytics  What happened?
 Diagnostic Analytics  Why did it happen?
 Predictive Analytics  What will happen next?
 Prescriptive Analytics  What should be done about it?
Descriptive analytics involves insight into the past, analysing historical data. It is more about what has happened and not used to draw inferences/predictions. IBM considers descriptive analytics to encompass 5 stages:

Determining your business metrics/KPIs  e.g. improving operational efficiency or increasing revenue

Identify and catalogue the data required  whether from databases, reports, spreadsheets, etc.

Collect and prepare the data  cleaning, transformation, etc.

Analyse the data  including summary statistics, clustering etc.

Visualise/present the data  charts and graphs to visualise the findings
Diagnostic Analytics further extends descriptive analytics, assessing the descriptive datasets and drilling down/isolating the rootcause of an issue.
Predictive Analytics, on the other hand, is focusing on predictive modelling/forecasting. It is about predicting possible future events and their probability.
Examples of predictive analytics include:
 Forecasting future demand to improve supply chain
 Risk and detect fraud
 Forecasting future growth areas to focus marketing and sales strategies
Prescriptive Analytics, finally, builds upon descriptive, diagnostic and predictive analytics to identify the best course of action. It involves analysing the predicted outcomes and determining the impact of those outcomes.
For example, in the energy industry:

Descriptive analytics shows the last 5 years have shown a 300% increase in electricity demand. This has strained the existing electrical grid capacity to 90%.

Predictive analytics using machine learning forecasts demand will jump by another 200% in the next 2 years.

Prescriptive analytics can identify which particular assets/infrastructure need to be upgraded to meet this forecasted demand.
Knowing which type of analytics you are trying to achieve will frame many questions and starting points, including your hypothesis (as stated below).
Your hypothesis is your starting point
The concept of a hypothesis always start with formulating two things:
 What is the default position if the question isn’t answered (i.e. the null hypothesis)
 What is the alternative position if the question is answered (i.e. the alternative position)
So you can see two scenarios in which the data question is very different:
Scenario A
 Default Position  we don’t sell cupcakes
 Alternative Position  we sell cupcakes if it is profitable
Scenario B
 Default Position  we keep selling cupcakes
 Alternative Position  we stop selling cupcakes if it is not profitable
You can see how you frame the question will totally change how you approach the data question. Proving selling cupcakes is profitable is a different exercise to showing selling cupcakes is not profitable (e.g. if you don’t sell any cupcakes, does that count as profitable?)
Another example is if your framed a question like this: everyone is guilty until proven innocent. Even if you come up with the best data science model, people are going to wonder how on earth did you even come up with such a crazy question to begin with!
Golden Hammer Bias
The ‘golden hammer’ is a cognitive bias where a person excessively favours a specific tool/process to perform many different functions.
A famous quote attached to this theory summarises it nicely:
If the only tool you have is a hammer, every problem you encounter is going to look like a nail
The golden hammer bias can lead to a ‘onesizefitsall’ model or approach to a data issue. In statistics/ML this is also known as the ‘No Free Lunch Theorem’. That is, there isn’t’t a single algorithm that is best suited for all possible scenarios and data sets.
Not evaluating your model properly
Every data science model needs to have a method to measure how well it is doing. The way to measure this is via evaluation metrics. The common types of evaluation metrics include:
Regression Problems  predicting an actual value (e.g. tomorrow’s weather)

Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE)  essentially measures the average error of the model. It is a measurement of error, so the lower the better. Due to its calculation, RMSE penalises bigger errors more than MAE.

Mean Absolute Percentage Error (MAPE)  Average % error of the model (lower the better). However neither RMSE, MAE, MAPE show the direction of errors (over or under).

Adjusted Rsquared score  essentially measures the correlation between the predicted values and the actual values. It is a measurement of goodness of fit, so the higher the better.

Akaike Information Criterion (AIC)  essentially measures how well the model performed, factoring in how complicated the model is. The lower the AIC, the more generalised it is and better the model (i.e. less likely overfitted). This metric is mainly used for time series forecasting, where there isn’t enough training data to compare which model performed better.
For every classification problem, expressed using a confusion matrix, you can see what the possible outcomes are:
Actually has Cancer  Actually does not have Cancer  

Predicted Has Cancer  True Positive  False Positive 
Predicted does not have Cancer  False Negative  True Negative 
Using the above matrix, you can see there’s multiple ways to evaluate the classification problem:
 Accuracy  how many correct outcomes the model predicted
 Precision  essentially measuring rate of true positives and false positives
 Recall/Sensitivity (i.e. True Positive Rate)  essentially measuring rate of true positives
 Specificity (i.e. Inverted False Positive Rate) essentially measuring rate of true negatives and false positives
 Balanced Accuracy (i.e. F1 score)  combination of the above.
The relationship between them all is defined as:
Accuracy = (True Positive + True Negative) /
(True Positive + False Positive + False Negative + True Negative)
Precision = True Positive / (True Positive / False Positive)
Recall/Sensitivity = True Positive / (True Positive + False Negative)
Specificity = True Negative / (True Negative + False Positive)
False Positive Rate = (1  Specificity)
F1 score = 2 / ((1 / Precision) + (1 / Recall))
When to pick what evaluation metric
Which evaluation metric is selected affects what you are achieving. A common type of classifcation will involve an ‘imbalance’ between the number of positives vs negatives. For example, 99.999% of the population will not have a particular cancer and only 0.001% does.
When you have an imbalanced problem, using accuracy as an evaluation metric wouldn’t be suitable. That’s because if I needed to predict whether a person has cancer, using the above scenario, I could say ‘no cancer’ and be right 99.999% of the time.
Therefore, in these instances, you can’t use accuracy as a measurement. You wouldn’t want your predictive model to do too many false negatives and false positives, but you may consider false negatives to be much more worse. That is, someone that had cancer but the model said they didn’t.
In these instances, an appropriate method would be using Recall, which factors in how many false negatives the model has.
 Use Precision if you care more about true and false positives  e.g. spam emails
 Use Recall if you care more about false negatives than false positives  e.g. cancer prediction
 Use Specificity if you care more about false negatives  e.g. predicting violent crimes
 Use F1 Score if you have an imbalanced dataset
 Use AUCROC score if you have a more balanced dataset (explained below)
It is sometimes a tradeoff (illustrated by the AUCROC curve)
In an imperfect world, true vs false positive/negative will be a tradeoff:
 If you predict no one has cancer, you get 0% false positives but 100% false negatives
 If you predict everything has cancer, you get 100% false positives but 0% false negatives
Some algorithms allow you to tweak the probability threshold/cutoff in which you consider something is true or not.
For example, if you took a 50% threshold (default position)  red indicates false negative and orange indicates false positive:
50% Probability Threshold
Patient  Probability of Cancer  Model's Prediction (applying 50% threshold)  Actually has cancer 

784  10%  N  N 
123  30%  N  Y 
585  40%  N  N 
1156  50%  Y  Y 
9725  55%  Y  N 
29823  90%  Y  Y 
For example, given the consequences of a false negative is much higher than false positive, you might set the threshold to be 30% (rather than 50%). This will reduce number of false negatives, but it will increase number of false positives:
30% Probability Threshold
Patient  Probability of Cancer  Model's Prediction  Actually has cancer 

784  10%  N  N 
123  30%  Y  Y 
585  40%  Y  N 
1156  50%  Y  Y 
9725  55%  Y  N 
29823  90%  Y  Y 
If you then took a more extreme approach and set the threshold to 10%, you would get basically no false negatives, but a lot of false positives:
Patient  Probability of Cancer  Model's Prediction (applying 10% threshold)  Actually has cancer 

784  10%  Y  N 
123  30%  Y  Y 
585  40%  Y  N 
1156  50%  Y  Y 
9725  55%  Y  N 
29823  90%  Y  Y 
The curve that illustrates the false and true positive rates at various probability thresholds is known as the Receiving Operating Curve (ROC):
ROC Curve = False Positive Rate (X axis) vs. True Positive Rate (Y axis)
We then calculate the Area Under this Curve to get a percentage. The baseline is 50%, which is random guessing and assuming you would get 50% wrong 50% right. Therefore, any model with a AUCROC score less than 50% is considered not good (as random guessing would fare better).
Knowing what you are measuring and what your ultimate goals are is a crucial important in building a good model!
Percentage of error is not always appropriate
Another example is not using Mean Absolute Percentage Error (MAPE) properly. Because in forecasting you could end up being way off, sometimes it’s not appropriate to use a percentage. Therefore, your model could be 1,000,000% inaccurate!
Furthermore, over forecasts are more heavily ‘penalised’, as the maximum amount of error for an under forecast is 100%. To illustrate, we’ll do a simple calculation:
 Actual value is $1,000
 Forecast value is $1,200
 Therefore your % error = ABS($1,200  $1,000)/$1,000 = 20%
Now take the same scenario, but with more extreme numbers:
 Actual value is $1,000
 Forecast value is $120,000
 Therefore your % error = ABS($120,000  $1,000)/$1,000 = 119,000%!
But if you under forecasted by a lot, there’s a cap on the error:
 Actual value is $1,000
 Forecast value is $3
 Therefore your % error = ABS($3  $1,000)/$1,000 = 99.7%! (Your number can never go above 100%)
Furthermore, if your actual value is $0, then you obviously can’t divide by error so you can’t even calculate a MAPE!
So picking the right evaluation metric is important as if you can’t reliably measure how well your model is doing, you can’t improve it (or prove it actually works).
Not using or developing a baseline model
A common mistake for people who are starting out in data science is jumping straight into the problem, building a machine learning model and calling it a day!
Ultimately the model needs to provide valueadd, either in more accurate predictions, time saved, capability increased etc. To do this, however, you need to have a baseline.
A baseline model is a model that does a prediction using naive/simple techniques. It is analogous to comparing your model to a random person guessing. It is important to note that you should not use a baseline as a final model. It is merely to illustrate/prove that your model can at least beat a baseline.
It is surprisingly hard to beat a decent baseline model  for example, in forecasting, naive forecast methods often beat many types of machine learning models. You can see an example in my previous blog post on electricity forecasting.
Baselines are also important when there are imbalanced datasets  such as banking fraud prediction models. If 99.9% of transactions are not fraudulent, blind guessing ‘not fraudulent’ for every single transaction will yield a 99.9% accuracy. So therefore, your prediction model should at least beat this.
Naive Baseline Models
These are called ‘naive’ because they apply a very simple rule to predict a value.
For regression problems, these include:

Persistence model (for time series forecasting)  e.g. oneyear persistence model means taking 1 August 2019’s value to predict 1 August 2020’s value

Zero Rule Algorithm using mean, median or most frequent value  i.e. predicting the future value using the mean/median or most frequent value in the existing dataset. They are known as ‘Zero Rule’ due to their simplicity and only predicting the same value for all scenarios (i.e. no rules).
For classification problems, these include:

Zero Rule Algorithm using median or most frequent value  i.e. predicting the future value using the median or most frequent value in the existing dataset

Dummy Estimator using stratified predictions  i.e. it makes a guess based on how frequent an a category appears. For example, in a dataset, 30% of the customers are ‘low risk’, so therefore it will guess ‘low risk’ 30% of the time.

One Rule Classifier  you select the feature that is the best estimator for the target and use that as the ‘rule’. For example, if you have a dataset for customer sales, ‘onerules’ to predict whether a customer buys a produce could be:
 If Premium_Customer_YN = ‘Y’ > Predict ‘Y’
 If Premium_Customer_YN = ‘N’ > Predict ‘N’
Simple Model as a baseline
For the purposes of doing a baseline, there’s a few algorithms that provide a good starting point to improve on.
The purpose of using these models isn’t to get a final model, but rather have something to compare to and demonstrate that any further ML work has more valueadd. These include:

Logistic Regression model  like linear regression, logistic regression can be ‘universally’ applied to a classification problem to get a decent baseline model. Furthermore, logistic regression allows you to tweak the probability threshold  e.g. if you get the threshold at 80% for predicting fraud (Yes vs No), then the model will only say ‘Yes’ if there’s at least a 80% probability there’s fraud. This allows you to effectively have a dummy estimator if you get the threshold to 0% or 100%. For regression problems, common types of baseline models include:

ElasticNet linear regression model (L1 + L2 Regularization)  linear regression is conventionally considered a baseline model, because of how ‘universal’ it can be applied to a regression problem. For the purposes of baselining, you won’t need to consider the linear regression assumptions, as you won’t end up relying on this for your final model.
Importantly, keep the features simple. Simple features means creates a simple model, resulting in a baseline model and metrics that can be used to evaluate your subsequent more complex models.
Rulesbased baseline models
Another form of baseline is a rulesbased approach (a.k.a. heuristicsbased modelling) that reflects the current business processes.
Many of these rules may be derived from years of business experience and known practices already adopted. This makes sense, given the goal of a machine learning model is to improve existing processes.
I discuss more about hybrids and heuristics in another blog post.
The end result may be that a hybrid model with 70% rulesbased and 30% machine learning will outperform a 100% machine learning model. After all, the focus is on a model that performs the best, not uses the fanciest technology/algorithm.
Not doing proper outsample testing
In general, when you develop a model (whether machine learning or rulesbased), you need to split the data between a training dataset and evaluation/testing dataset. The idea is your machine learning model should never see the evaluation/testing dataset when it’s training.
A simple split method is randomly selecting 70% of the data for training and 30% for testing. The reason you pick a random sample is to avoid data skewers (e.g. picking all the customers from a certain address).
More comprehensive methods such as Kfold cross validation can be used to further improve this splitting  it subdivides the dataset K times (e.g. 10 times) and for every subdivision, it will further do a traintest split.
ScikitLearn has a great visualisation of how this works:
Outsample testing is when you test your model using data that you didn’t use to train your model. You do this to prove that your model ‘truly’ can predict stuff it has never seen before. It is analoguous to exams for humans  you do your study and now you need to prove you can do it with something you’ve never seen before.
There are a few do nots:

Don’t abuse outofsample data by building a model, testing it on the outofsample data and constantly going back and forth. What you end up with is basically treating the outofsample data like your training data.

Avoid getting ‘inspiration’ or ‘trends’ from your outofsample dataset and then testing it on the same dataset. If you have a ‘trend’ you want to test it, do it on a dataset that you’ve never seen before.

Outsample data is not representative of the real world and has skewers/biases in it

Your outsample data isn’t really outofsample and is part of your training dataset
If you can’t reliably test how well your model performs with out of sample data, you can’t figure out how well it really performs!!
Not factoring in the BiasVariance Tradeoff
When you create a machine learning model, there are two competing things that constantly battle it out:
 You want your model as simple as possible
 You want your model to cover as many scenarios as possible
If your model does number 1 very well, it is biased and underfitting, because it ignores a lot of the dataset’s characteristics.
If your model does number 2 very well, it has high variance and overfitting, as it is trying to cover every single scenario possible.
You can’t have it both ways!
When training a model, there will be a point when the model will stop generalizing and start learning the noise in the dataset. The more noise it learns, the less generalised it is and the more horrible it is at making predictions on unseen data.
An example of a model underfitting is a model classifying dog vs wolves. The algorithm was given images of dogs and wolves to train on. However, the training images always had dogs in grass and wolves in snow. Therefore, rather then learn the general features of dogs vs wolves, it classified anything in snow a wolf and anything in grass a dog!
However, on the other end of the spectrum, being too specific can be problematic  a common analogy is a student that studies for an exam. The student can:
 Study the material and do the practice exam questions
 Remember all the practice exam questions and answers off by heart
Now if the student did the exam and 100% of the questions were the practice exam questions, then scenario 2 would mean they score 100%! But the real world doesn’t work like that, so you would expect Scenario 1 to give better outcomes because they are more ‘generalising’ their knowledge to cover more scenarios.
There is a constant tradeoff between BiasVariance  the more it goes one way, the less it’ll go the other way. On the one hand you want your machine learning model to be able to perform very well predicting the training data (i.e. the ‘practice’ exam), but at the same time if the model performs 100% it means overfitting (i.e. they are only good at answering the practice exam, not questions in general).
Overfitting is much more common than underfitting, as when you train a model, you’re always trying to get the possible metric! You keep going, tweaking until you get 90%, 95% and then 100% accuracy. Wait hold on  that’s a sign of overfitting!
You then run the model on outsample data and its doing 30% accuracy.
A great example and parody of overfitting by XKCD:
Not using percentages correctly
Percentages are a great way of describing data  it is a type of reference that many people are comfortable with. Saying 10% of sales or 50% of people is easy to understand.
However, because percentages are not absolute numbers, there’s potential for manipulation and confusion.
I’ve noted some tips below.
Don’t use percentages when the values can be negative
When values can be negative, it means percentages won’t make sense. For example, you are measuring the increases/decreases of the robberies in a city for a particular year:
Suburb  Increase/Decrease to # of Robberies 

Suburb A  +3,000 
Suburb B  2,000 
Suburb C  +5,000 
Suburb D  4,000 
Suburb E  +3,500 
Total Net Increase  5,500 
Taking the above, you can calculate that Suburb A contributed to 3,000 out of the 5,500 net increase to robberies and therefore 54.5% of the robbery increases came from Suburb A!
But that’s very misleading, because if you take the same logic to the other suburbs:
 Suburb A contributed 54.5%
 Suburb C contributed 90.9%
 Suburb E contributed 63.6%
Not a very useful metric!
Don’t use percentages when the total of the values can be over 100%
For example, the famous Colgate Ad which said 4 out of 5 of dentists recommend Colgate was deemed to be misleading by the UK Advertising Standards Authority.
What actually happened was dentists were asked which toothpastes to recommend. 80% (or 4 out of 5) dentists provided a list and Colgate was in 80% of lists. However, the list may look like:
 Colgate
 Oral B
 Sensodyne
Therefore, by Colgate’s definition, you could get something absurd like: 80% of dentists recommend Colgate, 100% recommend Oral B and 70% recommend Sensodyne. Gives a totally different (and somewhat confusing) meaning!
Another way to think of it is: if I can’t visualise it in a pie chart, then it doesn’t make sense to use percentages.
Comparing percentages with different bases
When you represent trends and growths in percentages, it helps illustrate the point, but sometimes is misleading. After all, 80% of $10 is very different to 80% of $1 million.
An example is if the chance of cancer from taking a drug was 0.00010%. If you then take a new vesrion of the drug, your chances could go up to 0.00018%. Technically speaking that is an 80% increase, but saying the drug increases your chance of getting cancer by 80% seems misleading. After all, the actual number is still very small.
A less misleading way would be saying something like: your chances of getting cancer go from 0.00010% up to 0.00018%, a total increase of 0.008%.
XKCD sums it up greatly:
Data leakage  the model is cheating
A key principle is preparing the data for machine learning, you split the data into at least two parts (or more if you do cross validation):
 a training dataset which contains the ‘questions’ (features) and the ‘answers’ (target)
 a test dataset which should stimulate real world, yettobeseen data. The model should never see the ‘answers’ (i.e. target) in this dataset.
Data leakage is when the training data for a machine learning model contains information which wouldn’t be available in a real prediction scenario. That is the model is ‘cheating’ and will result in a very good performing model during training, but not so good (or even rubbish models) during actual production.
Models are usually designed to predict the future, but are trained using historical data. Therefore, you need to be careful on what data the model should have available at the point of its decision/prediction.This includes when you preprocess data (e.g. imputers or scalers)  if you do it on the whole dataset before doing the traintest split, your model has had a sneak peak of the attributes of the test dataset!
If data that shouldn’t exist at the point in time of prediction exists in the dataset and you use it for training, your results will be distorted. For example, if you want to predict the gold price for the next 5 days:
Date  Gold Price 

20210802  ? 
20210803  ? 
20210804  ? 
20210805  ? 
20210806  ? 
The model at any point of prediction can only know what the gold price is up to that point. That is, when you train the data, you cannot do this:
Date  Gold Price 

20190802  1718.8 
20190803  1700.1 
20190804  1710.5 
20190805  1735.1 
20190806  ? 
Then do this:
Date  Gold Price 

20190803  1700.1 
20190804  1710.5 
20190805  1735.1 
20190806  1720.9 
20190807  ? 
The model at the point of predicting the next 5 days won’t know what the next 4 days of data is. Therefore, you are effectively training the model on future data (fortune telling to the extreme)!
Another example is data that is updated after the fact. For example, you are building an algorithm trying to predict defaults for customer loans. Your dataset comes from the company’s CRM which has a credit score field. However, you later realise that this when customer defaults, the CRM will update their credit score to be 0.
You then have to exclude this feature in your algorithm  otherwise model will say 100% correlation that 0 score = default. Which is true, but the 0 score was only updated after the customer defaulted.
Another good example is a model designed to predict whether a patient has pnuemonia  the dataset has a feature ‘took_antibiotic_medicine’, which is changed after the fact (i.e. you take the antibiotics after you get pneumonia).
Therefore, using this to predict whether someone has pneumonia is a lousy model  in real life, a patient doesn’t get antibiotics until they are diagnosed with pneumonia.
To avoid this, you should try to:

Have some holdout datasets  this is data that is completely not used in the entire ML process and only used at the very end to evaluate the model

Always do traintest split before you do preprocessing

Have a cutoff date  this is very useful for time series data. You specify a date where all data after a particular point is ignored.

Ensure duplicates don’t exist  this will avoid the same data be in both the training and test datasets
Not having a failsafe or automated validation layer for ML/AI
Occasionally, machine learning models will generate some crazy results that are totally unexpected. It may be the result of incorrect data (e.g. someone actually wrote 170km for their height instead of 170cm). If these aren’t picked up in the preprocessing, it could end up with some obscure outcomes.
Unlike humans, ML models don’t ‘selfcorrect’ and therefore it’s important for humans to build in a separate layer of automated validation or ‘failsafe’. The idea is this layer checks the output to the model and applies some baseline checks to ensure the outcome is not completely unexpected.
This layer is generally rulesbased and reflects the fundamental ‘red lines’ of the organisation/business. For example, having a baseline safety check so a system will never grant a home loan for $1.
Having this separate layer is important as you can’t account for everything in the ML model. A ML model won’t know what your fundamental red lines are, but you as a human do.
As the old saying goes, “you can plan to fail, but you shouldn’t fail to plan.”
Using machine learning for nonstationary environments
Stationary means the underlying rules relating to the data haven’t changed. So therefore if the rules changed, your it is nonstationary.
If it’s very obvious past doesn’t indicate future, then you can’t use the past to predict the future. It’s like trying to predict how many people will drive tomorrow using data from 200 years ago (when they didn’t have cars).
This error is quite easy to make  say for example, you need to make a churn model that predicts whether a customer calling in will likely to leave your company. If you only have data for March available and March has a frenzy sale, you can’t really use that to predict April churn/sales. Why? Because the underlying rules relating to the sales changed (there’s no more frenzy sales)!
Furthermore, factors such as new customers, new products, seasonality etc. all come in. This is why the older the historical data is the less likely it can be used for predictive modelling  there’s just too much that could’ve changed.
Nowadays deep learning algorithms and more advanced reinforcement learning techniques can potentially overcome this via the algorithm ‘learning’ the changes. However, it is important to still keep this in mind  otherwise you will see big drops in your model’s performance.
Using biased datasets/processes to train machine learning models
At its core, machine learning models learn from existing real world data and all the above conventional knowledge on data science/analytics is your data, processes and model should reflect the real world.
However, this is very problematic if your data reflects historical biases, discrimination and other unfair events. By extension, you can see that this will result in training biased AI/ML, which has become more and more in the spotlight recently.
The European Union’s General Data Protection Regulation (Rectical 71) actually makes it unlawful for a person to be subject to a decision based on a discriminatory automated process!
A famous example of this was a machine learning model that Amazon created for recruiting staff. After a while, they realised that the machine learning model was biased against women, because historically, recruitment favoured men over women.
Therefore the model, using this historical data, ended up favouring men applicants over women too. The historical discrimination and bias was further entrenched!
Other highprofile examples of discriminatory AI and models include:

Correctional Offender Management Profiling for Alternative Sanctions (COMPAS)  an algorithm designed to predict whether a criminal defendant would likely reoffend, was used by US courts, correctional facilities, etc. as a factor in their decisionmaking. Unfortunately, historically bias against black defendants in the criminal justice system was entrenched and the algorithm was arguably biased against black defendants.

Google Photos categorisation algorithm  the algorithm categorised images in Google Photos, such as ‘dog’, ‘car’, etc. It was discovered there was a significant lack of photos of darkskinned people fed to train the algorithm and therefore it erroneously labelled some darkskinned people as ‘gorillas’. Oh dear!
Unfortunately to address this issue is almost to contradict all the above points I mentioned, as essentially your model has to deliberately not reflect the historical reality to not be biased. So in a way, if you were to apply your model against historical data, it should in theory be inaccurate, because it doesn’t result in biased results like the historical actual data has.
Google AI has outlined some Responsible AI Practices. I’ve summarised a few highlevel tips:

Ensure you use enough test data that is inclusive/fair  for example, including a significant enough portion of cultural, racial, gender, religious diversity

Ensure you use test data that reflects the future real world use  for example, if you are going to deploy a speech recognition algorithm globally, test it on people with different accents

Have people play ‘devil’s advocate’ and adversarially test the system with edge cases to see how well it performs
As we become more and more reliant on automation and AI/ML, the effect of bias in these models becomes a bigger issue.
Closing Thoughts
So this wraps up my 2part blog on common data science/analytics pitfalls.
Hope you learnt something new!