Testing in Machine Learning

Anisha Narang
4 min readNov 8, 2021

--

Testing in Machine learning is way too different from the way we test software traditionally(functional tests, regression tests, etc.) where we check for actual vs expected behavior of any given application. In the ML world, data sets with desired behavior are used to train the model(learn the logic) and we need to check if the trained model consistently provides us with the expected output.

Phases in an ML project and the corresponding stages of the data pipeline:

Source: https://greatexpectations.io/blog/ml-ops-data-quality/

In order to proceed step by step, we look at the Validation stages in the diagram above and that’s where we need the tests as a part of the ML pipeline. Data ingestion stage ingests the raw data set and Data cleaning stage performs the preprocessing and normalization required in the training samples such as removing missing values, duplicates etc. The output of the stage is a cleaned data set.

There are two kinds of tests that you will need to focus on:

  • Pre train tests: The ones that include assertions about the data set, more of data validation after data ingestion and data cleaning stages. Now, why do we need those? The cliche “Garbage in, garbage out” is popular in Machine learning as the quality of data that is used for training determines the quality of predictions. That’s how Data Validation in Machine Learning is imperative, not optional!
  • Post train tests: The ones that will come into the picture after the model has been trained using our validated data set. As a starting step, you would want to cover the minimum functionality tests in this and determine the numbers: f1 score, recall and precision(it’s important to have a deep understanding of what each of them means and how to get these numbers).

Getting started with Pre train tests

  • If this is your first time working with data, then this also might be the first time when you are looking at such a large data set maybe 100k lines(or more) of json records. In order to read and understand the data set, you might want to familiarize yourself with Jupyter notebooks and pandas library.
# To view the test data, columns and values of records in the Jupyter notebook and see the output of each command.import pandas as pd
test_df = pd.read_json('path/to/raw_data.jsonl', lines=True)
test_df.head()
  • You can refer the pandas cheat sheet here.
  • Now, that you are comfortable using the pandas library and would want to move to writing some data validation tests. Great expectations library helps you to write pre train tests with some data validations.

From the Great Expectation documentation, here is what it says:

With Great Expectations, you can assert what you expect from the data you load and transform, and catch data issues quickly — Expectations are basically unit tests for your data.

In brief, the kind of data validations you might like to add:

- expect_column_to_exist
- expect_column_values_to_be_unique
- expect_column_values_to_not_be_null
- expect_column_values_to_match_regex
- you can refer the full list of expectations here.

A basic script with minimal validations using Great Expectations library might look like:

#validations_get_dataset.py
from pathlib import Path
import pandas as pd
import great_expectations as ge
DATA=’/path/to/test_data’
assert Path(DATA).exists()
raw_data = pd.read_json(f’{DATA}/raw/raw_data.jsonl’, lines=True)
print(‘\n — Columns from raw data: — \n’, raw_data.columns)

ge_raw = ge.from_pandas(raw_data)
expected_columns = [
‘description’,
‘serial_number’,
‘product’]
result = ge_raw.expect_table_columns_to_match_set(expected_columns, exact_match = True, result_format={‘result_format’: ‘BASIC’}, include_config=True, catch_exceptions=True)

assert result[‘success’]

To add the pre train tests to your pipeline, you might want to understand existing ML pipeline, in my case I was introduced to dvc.yaml which consists of all stages involved in a ML pipeline specified in a yaml format.

In short, a dvc.yaml is a file that contains all the individual stages of a machine learning pipeline starting from data ingestion to training to evaluation. For better understanding, please read more about DVC and dvc.yaml.

How to add data validation stages to the existing dvc.yaml?

#dvc.yaml 
#first stage: data ingestion, second stage: data validation
stages:
get_dataset:
cmd: wget /path/to/raw_data.jsonl -P ./data/raw
outs:
- ./data/raw/raw_test_data.jsonl
validate_get_dataset:
cmd: python ./data_validations/validations_get_dataset.py
deps:
- ./data/raw/raw_test_data.jsonl
- ./data_validations/validations_get_dataset.py
<third stage and so on>:

woah! You just finished adding pre train tests to your ML pipeline. The next time your pipeline is run for retraining the model, your data validation checks will be in place to make sure that the training data is as expected.

Will add more information about post train tests in the next post. This was my first technical post, I usually write about life experiences in general and share travel stories on YouTube.

How do you test data/ML pipeline in your project? Leave the response in the comment or leave a clap! :)

Special thanks to Manikandan Sivanesan for guidance along the way.

Articles that helped me have an understanding of the bigger picture:
- https://www.jeremyjordan.me/testing-ml/
- https://eugeneyan.com/writing/testing-ml/

I continue to learn…

--

--

Anisha Narang
Anisha Narang

Written by Anisha Narang

explorer, solo traveler, dance enthusiast, travel blogger, amateur vlogger, practicing self guided yoga.

No responses yet