Using data to find information about data
Following on from my prior post about DataOps techniques to use data to manage big data infrastructure, the following theme will be using data to find information about big data.
Data-driven Data Catalog
A data catalog is a detailed inventory of all the data in your data infrastructure. Like an inventory list for a real physical warehouse, if you didn’t have this list, it would be pretty hard to find anything useful.
There’s a saying that if you don’t have a data catalog, your data lake is a data swamp. A poorly documented data lake means you have no idea what’s going in, what’s coming out and what’s festering inside.
Furthermore, as part of the EU GDPR data protection rules, you may also need a data catalog to demonstrate compliance with the law.
As with the above theme, you can use data-driven approaches to manage a data catalog.
Analogy to real life - home contents
In my travels, I find that an organisation’s catalog or stocktake of what data they have is analogous to many people’s home and contents insurance. It is recommend as part of having home and contents insurance, you keep a catalog/list of all the items in your house that covers:
- purchase price (if applicable)
- location in house
- receipts or ownership history
Now when was the last house you know someone that actually keeps such a detailed list? Probably not likely and even if they did, it would likely be out of date or incomplete.
However, if you were to ask most people to name say the top 10 most important/expensive items in their home and provide the above information, they would likely be able to.
If you apply this analogy to data in organisations:
- Most organisations have a good grasp and information on their most critical data
- However, most organisations would not be able to catalog of the data they own, or if they did, it would most likely be out of date.
Data catalogs need accessible and searchable metadata
All the metadata you obtain and generate through the data workflow and pipeline should be made available to users in the data catalog. That is, every single piece of data should have metadata that describes:
Data lineage - where the data came from
Data currency - when the data was ingested, processed or last updated
Data quality check and validation results - as described earlier
Data linkages - how it relates to other data (described below)
Data hierarchy - which dataset is the preferred source of truth. Having a hierarchy will prevent scenarios where two different departments provide different numbers to the same metric.
Data security classification - whether it is restricted/sensitive data and subject to privacy laws with limited use (e.g. can’t use it for predictive analytics, only regulatory reporting)
Furthermore, having standard naming conventions for business units/departments, data types, data purpose, etc. makes life so much easier and will go a long way in making even the name itself descriptive.
All the above feeds into the most important purpose of a data catalog - begin able to search for what data you need. It’s estimated that 60 to 73% of data an organisation has is unused for analytics.
If you can’t find the data or don’t know what the heck it is, you can’t use it! In a nutshell, you need a search engine inside your data catalog that searches not just the column or table names, but the attributes and metadata too.
There’s many data catalog tools out there that have this built out-of-the-box, but I personally find the capabilities of ElasticSearch pretty cool. It’s an open-source analytics engine that basically is like a search engine out-of-the-box!
ElasticSearch is within the ElasticSearch-Logstash-Kibana (ELK) stack, which includes the data analytics Web UI Kibana. Furthermore, it supports multiple languages to query the data, including Python, SQL and a REST API.
ElasticSearch is a powerful search engine because it implements two things:
ElasticSearch implements fuzzy matching using Damerau-Levenshtein distance. In essence, this is the distance between two texts measures how different the words are (e.g. ‘fox’ and ‘box’ are 1 swap apart, ‘axe’ and ‘aex’ are 1 swap apart and ‘face’ and ‘fac’ are 1 insert apart).
A ranking is then given for how similiar two words are based on this distance is. By default, the maximum allowed distance is based on the length of the search word - e.g. max distance of 1 for words between 3 to 5 characters long.
Similarity is then ranked using a ‘scoring’ model, which for ElasticSearch, uses a Lucene’s Practical Scoring Function. This algorithm mainly focuses on the concept of Term-Frequency & Inverse Document Frequency (TF-IDF). TF-IDF involves calculating how frequent the word shows up in the dataset (TF) and how rare a word is across all the datasets (IDF), giving you a weighting/number per word.
ElasticSearch then uses Okapi BM25, which factors in the TF-IDF weighting to determine the ‘best match’ using a complex vector and probability-based algorithm. This is a known as a bag-of-words approach, as it treats each word as a individual data point (similiar to how you throw all the tiles of letters into a bag in Scrabble). Think of it like a sparse matrix of words, where every single word is a ‘column’ in the matrix and every ‘row’ is a vector.
|The cat jumped over the lazy fox.|
|In rare cases cat will fly away.|
|Cat vs dog.|
Whenever you add data into the ElasticSearch database, it will index and process the ‘terms’ associated with the text. The search is then done on these ‘terms’, not the actual text you put in.
These ‘terms’ generally tokenise the original text, removing ‘stopwords’, such as ‘the’, ‘and’. Furthermore, stemming is done on the words to factor in past vs present tense, as well as number vs word representation of words (e.g. four vs 4). ElasticSearch also lets you factor in synonyms as well.
So in the above example, the generated sparse matrix or ‘vectors’ would be:
|The cat jumped over the lazy fox.||1||0||1||1||1||1||0||0||0||0|
|In rare cases cat will fly away.||1||0||0||0||0||0||1||1||1||1|
|Cat vs dog.||1||1||0||0||0||0||0||0||0||0|
The above is over-simplified and probably not 100% accurate on how ElasticSearch actually works under the hood, but hopefully gives a bit of flavour as to how powerful NLP algorithms can be.
It’s pretty cool that such a complex NLP algorithm is out-of-the-box in an open-source product! In the field of NLP ML, this type of algorithm is used quite a lot, for example, the TF-IDF vectorizer in Scikit-learn and tokenisation/stemming in NLTK.
If you want more details, you can read Elastic’s official documentation on it.
Use data-driven analytics and machine learning
As I mentioned before, if metadata is treated as data, then really you are using data about data to train ML to identify data related to the data!
Clustering, Correlations and Trends
Since you have generated all the metadata about the data, this data can be used to train a ML model that will automatically identify and tag relevant data.
Data-driven approaches based on data types can also identify the probability of two datasets have correlations. For example two datasets could be correlated and flagged as ‘related’ if:
- have the same start and end times with the same amount of data points
- were generated at the same time
You may notice identifying these characteristics is similiar to the Exploratory Data Analysis (EDA) step in ML/data science. In EDA, there is a focus on identifying trends, correlations, statistical relationships as well as just generally how the data relates to each other.
ML can even monitor usage and access metrics to determine and rank which datasets are the most popular by using a recommendation system (similiar to how Netflix recommends movies for you to watch based on people’s viewing patterns).
A great out-of-the-box open-source ML library is scikit-surprise, which allows you to pick different types of algorithms, including the one that won a Netflix prize.
Automatically generating tags and descriptions
Natural-language processing machine learning (NLP ML) already exists in many other fields and domains. Some use cases for NLP ML include:
- semantic analysis of keywords to determine the overall ‘theme’ of a dataset
- automatically generating descriptions based on metadata.
For example, using the recently released open-source NLP ML library GPT 3 created by OpenAPI, you can create paragraphs of text using some keywords. The new GPT-3 is already pre-trained with over 175 billion parameters so out-of-the-box and for some generated news articles, it fooled 88% of human participants into thinking it was written by a real person!
Having a ML library generate natural language descriptions of the datasets would greatly benefit a data catalog, including the search capabilities mentioned above.
Enrich your dataset using external data and data matching
It is common for an organisation to hold the data relating to the same entity/transaction in different source systems, or even incorporate external 3rd party data relating to those entities/transactions.
A data catalog could also connect data from 3rd party sources, such as weather or GPS coordinates, to determine some basic metrics or attributes related to the data.
With all these different systems and sources, it is inevitable that unique identifiers and typos will occur. This makes what should be very useful data difficult to use, as you can’t join up the datasets using the identifiers.
Here’s an example of the same entity but formatted differently in three different systems:
|Dataset||Full Name||Date of Birth||Address|
|Dataset||John A. Smith||1/1/90||Auckland, New Zealand|
|Dataset||SMITH, J A||1990.1.1||Auckland, NZ|
|Dataset||John Smith||Jan 1, 1990||Auckland|
Fortunately, this is where ‘data matching’ algorithms come in, providing a set of techniques that further clean, enrich and identify linkages between datasets.
At a basic-level, there are two main types of data matching - deterministic vs probabilistic.
Deterministic data matching involves comparing whether there’s exact matches between unique identifiers or natural keys (e.g. phone number, address, date of birth). A data matching (confidence) score is then generated based on how many matches there are. For example:
HIGH confidence - if a person’s name, address and date of birth match
MEDIUM confidence - if a person’s name and address match - (could have two people living in same house with same name, like Bob and Bob Jr)
LOW confidence a person’s address match, but name doesn’t match
The downside of deterministic matching is if there’s typos, it will affect the matching score. For example:
- SMITH, JOHN A and JOHN A SMITH are the same person, but the text is not 100% match
- JOHN A SMITH and JOH N A SMITH are the likely the same person, but there’s a typo
Probabilistic data matching on the other hand uses ‘fuzzy’ matching, similiar to how Google search engine knows you meant ‘where is my house’ when you accidentally typed ‘where is my h ouse’. It uses statistical-based approaches to determine a confidence score, such as factoring how unique the data points are and the likely it would be a match.
The approach is similiar to how the Naive Bayes machine learning algorithm works.
For example, if only 200 people have the last name ‘SUPERAWESOME’ in the data of 10 million people, then the probability of them matching would be much higher than a more common last name like ‘SMITH’.
The weighting of each data point is then aggregated to determine an overall matching score.
Once you have a matching score, you can include it in the data catalog. In a perfect world, the score would be 100%, but the circumstances of the data work may permit a lower threshold (e.g. 85%).
Having that flexibility and option greatly enriches the possibilities you can do with that data, as you can link up data that otherwise wouldn’t be ‘related’ using traditional joining methods.
Hopefully that gives you some inspiration or ideas on how you can apply data-driven techniques to find and catalog data.