If we ‘teach’ AI/ML by way of examples, how do you ‘teach’ an AI when you do not have enough examples?
I wanted to touch upon this topic, as it is a very real-world challenge experienced by most ML projects and systems. I too have experienced this issue in my day-to-day ML work.
Unlike most ML tutorials or online courses on ML, in the real world, it is often difficult to actually get reliably labelled training data. Labelled data is data that has been hand-curated so each data point is ‘labelled’ with the correct answer (e.g. in the context of spam filters, every single email has been categorised as ‘spam’ or ‘not spam’).
Realistically, in most cases, there will only be a small amount of labelled data that has been curated with proper examples, with the majority of data being unlabelled. For example, if you wanted to create a image recognition in your backyard to determine whether a bird landed, you would need examples of birds and ‘not birds’.
There are public datasets you could obtain, but you will also need specific examples of birds local in your area. Plus you would need to classify/label each picture to determine whether its bird vs not bird. Overall, it’s a lot of work!
So in this blog, I will explore some of the methods to deal with lack of or little training data. In particular I will cover:
What is reliable labelled data?
Downsides of relying on supervised learning when there’s little labelled data
Semi-Supervised Learning: Active Learning and Pseudo-labelling
Generative Models: Generative Adversarial Networks and Variational Autoencoders
What is reliable labelled training data?
Well, in a nutshell, it is the Ground Truth that has been curated and represents the ‘correct’/’true’ answers.
The term Ground Truth comes from geological/meteorology - you may get a weather observations from a satellite or radar station, but you get the Ground Truth (i.e. actual observation), when you go on-site ‘on the ground’ to take measurements.
For example, if a ML model will be predicting whether a person has cancer, the Ground Truth is a dataset has explictly categorised whether each patient was actually diagnosed with cancer or not. The actual diagnosis (i.e. the ‘answer’) is known as the label and a dataset which has these labels is known as a labelled training dataset.
Labelling, ground truthing or annotating is the process of adding labels to a dataset. In the case of the cancer example, this would mean a medical professional has validated by hand that the ‘answers’ correctly reflect the medical diagnosis of each patient.
There are many instances where there just isn’t enough labelled data. There may be organic labels in the raw data that strongly correlate with the ‘correct answer’, but ultimately a human is often required to verify by hand that the labels are the ‘correct answer’.
As expected, this is both costly and a time consuming process, making Ground Truth and labelled datasets a premium in the world of machine learning. Especially datasets which are sufficiently large or contain enough examples of the problem that the ML model is trying to solve.
This labelling process may be crowd-sourced or outsourced to a 3rd party (e.g. Amazon Mechanical Turk) to have humans manually label all the data. However, in many cases, this may not be a viable option, particularly if the data is sensitive/confidential.
Given how quickly data is growing in this world, there is exponentially way more unlabelled data being created than labelled data (e.g. think of how many YouTube videos are uploaded or how many tweets are done every second!)
Furthermore, if your training data contains positive feedback loops, then training a model on this data would reinforce the feedback loop. General tip is do not use data with feedback loops for ML training.
Feedback loops can ruin a training dataset
Feedback loops occur when the output/label of the training data is directly/indirectly caused by the inputs of the training data.
A common example as I mentioned in a previous post of a positive feedback loop is putting a microphone next to a connected speaker. Sound going into the microphone gets amplified and output to the speaker, which then gets fed into the microphone, looping until it creates that horrendous ‘screeching’ sound when the volume is too loud.
In ML land this means:
- model features are the sound
- ML model output is the microphone
- human action is the speaker.
In a recent real world example I encountered in my work, we had a model which:
Used heuristics (created by users) to determine whether a login was fraudulent, categorising it as ‘low’, ‘medium’ or ‘high’ risk
This categorisation was used as the label in the training data
The ML model is trained on this training data to determine whether a login is fraudulent using other factors
The output of the ML model is used as input into the heuristics for users to help categorise traffic as ‘low’, ‘medium’ or ‘high’
However, the ugly head of the feedback loop reared its head, as the ML training data labels were based on the heuristics that categorised logins and the heuristics then factored in the output of the ML model.
The result was traffic being historically erroneously categorised as ‘high risk’ would always be high risk, as:
heuristics (high risk) -> ML training label (high risk) -> ML model output into heuristics (high risk) -> heuristics (high risk)
I mentioned a similar phenomenon in my previous blog regarding bias in ML - e.g. if a ML model uses successful candidates as the basis to train a HR recruiting model, it will inevitably inherent all the historical bias of past decisions.
Downsides of relying on supervised learning or unsupervised learning when there’s little labelled data
When an AI is trained on proper examples (i.e. ‘labelled data’) it is known as supervised learning.
Supervised learning relies on all the data to be hand-labelled to be effective and it’s effectiveness is greatly limited by the amount of labelled data available. This makes it both expenive and ineffective in cases where there’s little labelled data.
Unsupervised learning (as covered in my previous blog post) avoids the labelling issue by not requiring any labels at all. However, these algorithms generally only do clustering and are not effective in creating ML models with high accuracy.
Building ML systems are therefore bottlenecked by the cost of high-quality labelled data.
So what do you do when you do not have learning materials (i.e. labelled training data) to train a machine learning (ML) model?
The answer is you need to generate it somehow. Below, I will cover two methods to do so: active learning and pseudo-labelling.
Semi-supervised learning: Active Learning and Pseudo-Labelling
Active learning is a type of semi-supervised machine learning - ‘semi’ in the sense that it uses a small amount of labelled data to generate lots of unlabelled data for the ML model to train on. It is ‘active’ as the learning algorithm will interactively query the user for input to help it label data points.
The idea is that the addition of human input will greatly enhance the efficacy of the ML labelling process, as (supposedly) the human user will have a much better idea of what is the ‘correct’ label than the ML algorithm.
The most common method is known as pool-based sampling - this involves the algorithm selecting a subset of examples from a large pool of unlabelled data that need the user to label.
These subsets are selected by the algorithm to be optimised so it will provide the most amount of information to a training dataset. How many ‘queries’ it asks the user will depend on the budget limit set. The lower the budget, the less questions the algorithm will ask.
Compared to manually labelling thousands or millions of data points, it is a much more effective use of human effort, money and time.
A commonly used open-source Python library that supports active learning with scikit-learn (and others) is modal-python.
Naturally, managed cloud ML services, such as AWS SageMaker, also support active learning as well.
Similar to active learning, pseudo-labelling is another semi-supervised ML method that uses a small amount of labelled data to add ‘psuedo’ labels to a large amount of unlabelled data:
First it trains on the labelled data
Next it uses this model to predict labels on the unlabelled data
Then it uses the labelled data to determine how ‘accurate’ or ‘confident’ the predictions are
Pseudo-labelling is quite an emerging field and mainly used in the context of neural networks, so for the purposes of this blog, I won’t dive further into it.
Self-supervised learning is relatively new field that has an increasing amount of research and interest. One of the biggest’s advocates for it is Meta (formerly Facebook).
Natural language processing has been a big focus of self-supervised learning methods (e.g. Meta’s open-source NLP library FastText was created with the help of self-supervised learning)
So what is it?
In short, self-supervised learning is a form of ML algorithm that includes a ‘pre-learning’ step - i.e. in this step, the labels are auto-generated by the ML model training itself on selected parts of the dataset.
Yann LeCun, VP & Chief AI Scientist at Meta (formerly Facebook), provides a great overview of how self-supervised learning works:
- Predict any part of the input from any other part
- Predict the future from the past
- Predict the future from the recent past
- Predict the past from the present
- Predict the top from the bottom
- Predict the occluded from the visible
- Pretend there is a part of the input you don’t know and predict that
In the context of NLP, it would look something like this:
- ML algorithm gets a sentence
- It then masks a word and will try to predict this word
- It then compares the prediction with what the actual word is (which it knows because it masked it)
- It then randomly swaps out this word and then predicts which word in sentence is wrong
- It then compares the prediction with the correct sentence
This iterative process is continued as needed.
As a relatively new and emerging field, self-supervised learning is still mostly in the AI research stage and not widely open for most general types of ML problems.
Next, I will discuss generative models and how it too can take a small amount of labelled data to create powerful models.
Generative Models - Generative Adversarial Networks (GAN) and Variational Autoencoders (VAE)
Just a disclaimer - the generative models I’m going to discuss are very complex neural network-based algorithms. The mathematics and intuition behind them is complex and often is done by advanced AI/ML researchers. I don’t profess to be anywhere near as good as these advanced researchers, so I will only discuss the intuition behind these algorithms at a high-level, while avoiding the detailed mathematics behind them.
Generative models are a type of ML algorithm which generates synthetic data in the process of training. Mathematically it adopts Bayesian inference, a method of statistical inference which adopts Bayes’ theorem.
Without going into too much mathematical detail, the underpinning concept of this theorem is Baye’s rule: the probability of an event is determined by prior knowledge of the event and the likelihood of the event happening. As more information becomes available (i.e. more facts are known), the probability is updated and fine-tuned.
Importantly you have two things: the prior, which is what you believe the probability is before applying Bayes’ rule and the posterior, which is what the probability of the event occurring after applying Bayes’ rule.
Baye’s theorem is powerful because of its flexible application - you can factor in many known facts/probabilities into a model to fine-tune results. It focuses on conditional probability: probability of an event happening given a particular fact/prior.
For example, it is winter and the chance of rain on any given winter day is 10%. However, you know that in the last 5 days it rained once. So what is the probability it will rain today, given the prior knowledge that it rained once (1) in the last 5 days?
In the above example, you would use the 10% chance and the prior knowledge of 1 in 5 to calculate the final posterior conditional probability.
That is: Posterior (conditional probability) = (Likelihood x Prior) / Scaling Factor.
There is an excellent and advanced article on Medium that goes through how Bayes’ applies to generative models in more detail (including the mathematics behind it).
But in short, there are two types of generative models I will discuss now: generative adversarial networks (GAN) and variational auto-encoders (VAE). I’ll start with GANs first.
Generative Adversarial Networks (GANs)
Turing Learning and Adversarial Networks
The other day, I was thinking of an interesting question related to the Turing Test. It is a test devised by Alan Turing (the man often credited with inventing modern computer algorithm concepts). The test involves a human and an AI - the human (who doesn’t know whether it’s interacting with a human or AI) will interact with the an AI and if it believes it is a human, then the computer/AI will have considered to ‘pass’ The Turing Test.
The test itself is somewhat contested on its efficacy and whether it can actually do what it is supposed to do; however, it made me ask another interesting question.
Is it possible to create an AI that can administer the Turing Test?
That is, create an AI that will judge whether it is interacting with another AI or human.
The answer is: you train two models! One will be trying it’s best to fool the Turing Test and the other will try it’s best to administer the Turing Test.
This form of learning is known as Turing Learning:
A discriminator’s role is to try to determine real vs fake
A generator’s role is to try and generate data that can fool the discriminator
As these two models are ‘pitted’ against each other, it is known as an adversarial network.
For example, in the context of spam detection: one ML model would be trying to create the best spam as possible and the other ML model would be trying to create the best spam detection model as possible.
Generators and Discriminators
A small set of labelled data is used to train the GAN. However, an interesting point is how the labelled data is used:
Discriminators only use labelled data as positive examples during training. Output from the generator is used as negative examples during training.
Generators only use random inputs to generate fake data. It also uses feedback from the discriminator as the ‘labelled data’ to improve itself during training.
GANs incorporate both supervised and unsupervised machine learning:
The ‘supervised’ part is where you provide positive training examples to the discriminator. It is ‘supervised’ as a human has curated and labelled this training data as the ‘correct’ answers.
The ‘unsupervised’ part is where the generator generates negative training examples. It is ‘unsupervised’ in the sense that no human has explicitly labelled these as ‘incorrect’ answers.
Furthermore, the generator and discriminator do not train at the same time - when the discriminator is training, the generator’s algorithm remains constant as it produces examples for the discriminator to train on, and vice versa.
This adversarial process is iteratively done - the generator gets better at generating fakes and the discriminator gets better at detecting fakes.
The ideal scenario is GAN convergence- the generator becomes so good at generating fake data, the discriminator can’t tell the difference between real vs fake. That is, the discriminator has a 50% accuracy (no different to a coin flip).
This form of adversarial interaction is based on Game Theory, specifically a zero-sum game. In this form of game, participants (i.e. players) compete for a fixed and finite amount of resources. When a player gets more resources, it is at the expense of another player.
To survive, players ultimately apply MinMax techniques - which are ways to minimise the maximum loss from the opposing player’s moves while maximising the gains from the player’s own moves.
In the context of GANs, it is zero-sum, because if the generator performs well, the discriminator gets worse and vice versa.
Providing both players survive, eventually this adversarial interaction will result in a Nash Equilibrium. Both players have reached their peak ability given the other player’s abilities to thwart it.
Some excellent open-source Python libraries provide a good way to train and make GANs - e.g. TF-GAN, which runs on top of TensorFlow.
Variational Autoencoders (VAEs)
VAEs use the concept of an autoencoder, a ML model which uses 2 neural networks: an encoder and decoder.
Encoders focus on compressing the input, while decoders try to reconstruct the input from the compressed form. The compressed form is known as latent space, which means the data is represented in a multi-dimensional way so that similar data is ‘closer together’ in space.
That is, VAEs will encode a given input into latent space and then decode (i.e. reconstruct) the input from the latent space. The concept is similar to dimension reductionality (which I discussed in a previous blog post)) and the aim is to keep the maximum amount of information when encoding, while trying to be as ‘general’ as possible (i.e. regularised).
The ‘variational’ part of VAE ensures that the latent space is as regular as possible - i.e. it needs to be as general as possible, while retaining the maximum amount of information possible. This ensures only the key information is retained in the latent space.
This is generally done by sampling and distributions.
This process is done iteratively:
- Decoders act as the generator (creating new data points from the latent space)
- The decoder model results are compared to the original input
- Both the encoder and decoder models are updated based on the results to improve it
Unlike GANs, VAEs are not ‘adversarial’ in nature and the encoder and decoder work hand-in-hand to get the best possible reconstruction.
Hopefully this gives an insight into a topic I personally find interesting.
After all, we can’t be clicking on images of cars vs bicycles forever to help label self-driving car training datasets…