The Machine Learning Minefield - How to Avoid Getting Hit by Machine Learning Poisoning

Data corruption that destroys machine learning systems

Posted by Albert Cheng on 22 March 2022

Last Updated: 22 March 2022

Privacy is becoming more important

Artificial intelligence and machine learning (AI/ML) is becoming even more prevalent in day-to-day life. We sometimes even place our lives in their hands! Think about the systems that manage our electrical grid, ambulance operator dispatch, cybersecurity and more.

Unfortunately, when there’s something new, there’s always someone that wants to break it. One such method to attack AI/ML is known as machine learning poisoning, which deliberately tries to corrupt the AI/ML system so they become either ineffective or inaccurate.

In this blog, I will explore the concept of machine learning poisoning and how it is a very dangerous form of attack!

Binary World Map Photo by GDJ from Pixabay

How do you poison a machine learning system?

Machine learning systems are built by training - that is, an algorithm is fed with data that has examples of what the result should look like. For example, training a ML system for predicting weather would involve giving it examples of what the data would like when it rains, when it snows, etc.

How well a ML system performs is heavily dependent on its training dataset - if you give a ML system bad or incomplete data, it can become biased. I covered this concept quite a bit in my previous blog post.

Therefore, ML poisoning exploits this vulnerability by deliberately corrupting the data, causing the ML to behave in unintended ways. This is also known as logic corruption.

The data poisoning can be done against either the training data or the data input for prediction.

When it is done against training data, the aim is to inject enough data so the ML is trained to provide incorrect/skewered results. It is analogous to the naughty older sibling teaching the younger sibling bad habits. For example, if you inject malicious data points that deliberately mis-classify a human face as a donkey, the ML model can get corrupted and start mis-classifying human faces as donkeys.

Distorted Painting Photo by GDJ from Pixabay

When it is done against the input data, the aim is to inject enough irregular/unexpected data points into the input. For example, as I will discuss below, adding strange pixels to an image can cause an image recognition model to mislabel a polar bear as a washing machine.

Poisoning the Training Data

There are two main forms of ML poisoning attacks:

  • Availability attacks - this aims to render the entire ML system ineffective or useless by injecting as much poisonous data as possible.

  • Integrity attacks - these are surgical attacks that create backdoors, allowing a particular type of input to trigger a specific result from the ML system. For example, injecting malicious data so whenever a username is called ‘iamahacker’, it will never blacklist their traffic.

It is amazing how easily even a sophisticated AI/ML can be corrupted just with some poisoned data. If data is the lifeblood of ML systems, then having poisoned blood will definitely cause issues!

There has been evidence to suggest that poisoning just 3% of a training dataset reduces the accuracy by 11%.

A commonly cited example of a ML poisoning attack is against Google’s Gmail spam filter. Many attempts were made to corrupt this filter at a very large-scale - over millions of spam emails were sent and automated bots would flag them as ‘not spam’. The idea was to poison the spam filter ML training data, so when the Gmail spam filter ML is re-trained, it will erroneously categorise those emails as ‘not spam’.

Fortunately, Google came up with ways to defend against these attacks, which you can read more about here from the head of Google’s anti-abuse Research Team.

Another example: a few years ago, Microsoft increased an AI chatbot, Tey, to learn from people’s tweets on Twitter. Before long, people were maliciously tweeting racist and misogynistic and within 24 hours, Tey was making some very disturbing and racist statements! You can read about it here.

The Matrix Photo by geralt from Pixabay

A more sophisticated method of attacking a ML system also involves building a clone of the target ML system - which I will discuss below.

Building out a clone of the ML system to attack

For the majority of attacks, the attacker will not have detailed and intricate knowledge of the ML system. ML systems are generally well-guarded secrets by organisations, as exposing their inner workings is both a security and potentially data privacy risk. In cases when the attacker has no knowledge of the system, it is known as a black bot attack.

However, rather than taking a stab in the dark, attackers can use grey box attacks.

In a grey box attack, the attackers will build out a surrogate model that will mimic the behaviour of the ML system. This involves invoking the ML system to make predictions on a wide-range of data inputs and observing/recording its output. The inputs and outputs are then used to train a mimic/clone ML model (i.e. the surrogate model).

I discussed above surrogate models in an earlier blog post, albeit in the context of helping explain how a ML models. Ironically, such a method is also great to attack said ML model too.

Using the surrogate model offline (i.e. not involving the target ML system), the attacker will deliberately inject malicious data into the surrogate model to see whether it can change its behaviour. Once they have ‘proven’ out a logic corruption, they will then attempt a large-scale attack on the target ML system using what they’ve learnt from the surrogate model.

Also ironically, the target ML system can also defend against such a grey box attack by counter-poisoning an attacker’s surrogate model. This is done by deliberately providing bad predictions to confuse any system that is observing the target ML system. For example, deliberately classifying certain emails as spam (they are in fact legitimate) for no reason other than to throw an attacker off on what is actually ‘spam’.

Clones Clones Photo by cottonbro from Pexels

Adversarial Images/Perturbations

Adversarial images/perturbations are a form of input data corruption that aims to purposely confuse an image recognition ML model. In this case, perturbation means adding enough noise to an image that is invisible to the human eye but just different enough to confuse a ML system.

The aim is to purposely cause image recognition ML to make incorrect predictions. The noise is deliberately tailored to the specific type of image recognition ML.

Below is an example from a research paper which shows what happens when noise is added to an image of a panda.

Clones Image from Explaining and Harnessing Adversarial Examples (2015)

On the left, the image recognition ML classified the original image of a panda with 57.7% confidence. Once the noise was added, the ML model was 99.3% confident it was a gibbon. However, importantly, note that the human eye can’t see the noise pixels in the image on the right.

In the above case, the noise pixels deliberately target the neural network’s method of reading pixels - that is, it is tailored form of malicious bad data designed to defeat a certain type of image recognition model.

Another example is self-driving cars and stop signs. Again from another research paper,, it shows an example of two stop signs. To the human eye, both seem like ordinary vandalism. However, these images are actually deliberate image perturbations which target image classifiers for self-driving cars.

Clones Image from Robust Physical-World Attacks on Deep Learning Models (2017)

The result is self-driving cars started to classify this as a speed limit sign, not a stop sign!

The implications of this are very serious - an attacker can deliberately cause car accidents just by corrupting stop signs! They could also fool security cameras and smuggle in weapons, or confuse facial recognition systems!

Now we’ve discussed how serious poisoning attacks are, next we will cover how to counter them.

So how do we detect ML poisoning attacks?

The easiest way to prevent data corruption is to detect it before it is used to train a ML model. In a way, data poisoning is a special case of data drift, which is a common occurrence for any ML system.

I discussed more about data drift in my previous blog post. In short, it occurs when the training data begins to drift or gets bad for a variety of reasons. Common example includes changes to the source input system (e.g. contract dates use a different format) or a change to the definition of a data field.

Therefore, using techniques to counter data drift, we can also prevent and detect ML poisoning by:

  • Data quality tests on the training data and input data - e.g. a person’s income shouldn’t be < 0

  • Outlier/anomaly detection on the training data - e.g. a small group of IP addresses shouldn’t account for majority of logins

  • Modifying/denoising the training data (specific for neural networks) - e.g. cleaning and removing the adversarial noise (e.g. autoencoders)

  • Creating a separate detection ML model whose role is to detect whether an input has been tampered with.

  • Differential testing on version of the ML model - using a static dataset as input, the output of the previous vs current model is compared. If there is a difference that is unexplained (or there’s a significant degradation), it indicates that the training dataset has drifted significantly (or there’s a significant amount of poisoned data) and ML model is unable to handle such data.

These techniques provide a natural defence against ML poisoning by handling poisoned data as part of handling data drift. In the final section, I will discuss specific techniques to combat against data poisoning.

Shield Image from Pixabay

How about specific data poisoning defences?

In addition to the general techniques to counter against data drift, there are specific techniques used to counter against data poisoning. I will discuss two of the more common ones:

  • Adversarial Training

  • NULL Classes

Adversarial Training

A commonly discussed method to defend against data poisoning, in particular adversarial perturbations, is adversarial training. This basically is when ML models are re-trained with adversarial examples (i.e. images with noise in it), but the correct label/result is set.

Similiar to a vaccine, because the ML algorithm has already seen this type of noise, it will develop a defence against it. In the above example, the model would be retrained with the corrupt stop sign that’s correctly labelled so it wouldn’t misclassify it as a speed limit sign.

However, like a vaccine, attacks can mutate and adapt to the ML model to introduce new adversarial perturbations. While it is possible to keep re-training the model with more adversarial examples, eventually there will reach a point where this isn’t feasible.

That is, the training dataset has so much fake data that it no longer can do its original job properly - classify the majority of legit data.

NULL class

The final defence I will mention is the NULL class method. This is specific to classifers and mainly for neural networks.

In general, classifiers have a limited set of options it can select from when making a prediction. For example, in a dog vs cat classifier, an algorithm will either say it is a ‘dog’ or it is a ‘cat’.

It may assign a probabilistic weighting to each prediction (e.g. 70% chance it is a dog, 30% chance it is a cat), but ultimately the model will have to make a decision on a class.

The vulnerability of such classifiers sees unexpected or ‘out of bounds’ data - e.g. if the dog vs cat classifier sees an image of a dog wearing a cat costume. Forcing it to decide from a pre-set list of options may result in some strange outcomes.

Therefore, a NULL class is effectively an extra option that is given to classifiers so when it is ‘unsure’, it can abstain from deciding by classifying it as NULL.

Practically this is often done by a novelty detection component of your ML system. Novelty detection is similiar to outlier detection, but rather than doing it on the training dataset, it is done on the actual inputs of the ML system.

Like outlier detection, the goal of this component is to detect whether an input is ‘different’ enough from what is normally seen to flag it. If so, it will make the ML system outcome a NULL result for that particular prediction.


Hopefully this gives you an idea of a very interesting area of machine learning! This is an area that isn’t often discussed much, but the consequences can be far-reaching!

Robot Photo by Kindel Media from Pexels