Testing in Machine Learning [Part 2]

5 min readNov 25, 2021

This post is in continuation to the previous one ‘Testing in Machine learning’ where I covered the basics of testing in ML and what would you require to learn in order to get started along with guided code snippets to get your pre train tests as a part of the pipeline.

** The overview of a ML pipeline and pre train tests has been covered in the previous post.

Getting started with Post train tests:

Now, that you have the pre train tests as a part of your pipeline, it’s time to dive into the post train tests.

Post train tests: These come into the picture after the model has been trained and we need to check for the consistency of the model prediction. There are 3 types of Post Train tests:

1. Minimum Functionality Tests: Identify scenarios for different sub-populations in the data set to check for regressions. (In scope for this post)

2. Invariance tests: To check for consistency in the model predictions with some perturbations in the data set. Let’s say if we include a data set with a synonym or change the subject, the prediction should not change.

3. Directional Expectation Tests: Directional expectation tests, on the other hand, allow us to define a set of perturbations to the input which should have a predictable effect on the model output. (let’s say increasing a number in one of the fields should not have an impact on the the predictions.)

You can find more about the above tests here (Source: https://www.jeremyjordan.me/testing-ml/).

How to get started with Minimum Functionality Tests?

In order to check for the consistency, we need to introduce ourselves to the confusion matrix and the classification report(recall, precision and f1 score).

What is a confusion matrix?

The confusion matrix is a N x N table (where N is the number of classes) that contains the number of correct and incorrect predictions of the classification model.

A confusion matrix looks like:

Source: https://www.analyticsvidhya.com/blog/2020/04/confusion-matrix-machine-learning/

In short, what each one of them means:

True Positive (TP):
The model predicted positive, and the real value is positive.
True Negative (TN):
The model predicted negative, and the real value is negative.
False Positive (FP):
The model predicted positive, but the real value is negative (Type I error).
False Negative (FN):
The model predicted negative, but the real value is positive (Type II error).
Source: https://medium.com/swlh/confusion-matrix-and-classification-report-88105288d48f

How do you calculate Precision, Recall and F1 Score?

You can refer here for a better understanding: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9

Source: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9

Once, you have a decent understanding of each of these metrics, you might want to proceed to get these numbers for your data. In Machine Learning, we usually split our data set into two subsets: training data and testing data which we call as Test/Train split.

Test Train Split:

The large data set(let’s consider 100k records like I mentioned in my previous post) would be split into two part:

Train data set: Used to train the ML model
Test data set: Used to evaluate the ML model.

We need data that is not used to train the model for evaluation purposes. Most common split percentage is Train: 80%, Test: 20%, but you can choose based on your data set and requirements.

How to test train split your data and things to consider:

import numpy as np 
from sklearn.model_selection import train_test_split
import pandas as pdtest_df = pd.read_json('path/to/raw_data.jsonl', lines=True)
test_df['product'].value_counts()>>>
Product A       11000
Product B       15000
Product C       2000
Product D       3000

If you observe the above output, it shows that our data set is not balanced, we have more records for Product A and B than Product C and D. In this case, we might want to stratify the test data set in order to have a balanced number of examples for each class label. Stratification is specifically for assessing the performance on critical sub-populations where we never want to see the regression.

#Let's say Product A,B are our top selling products and we want to analyze the accuracy of the model against Product A,B separately and then against other categories.
top_products = ['Product A', 'Product B']
other_products = ['Product C', 'Product D']df_top_products = test_df[test_df['product'].isin(top_products)]train, test = train_test_split(test_df, test_size=0.25, stratify=df_top_products.product, random_state=42)

You can read more about `stratification` and `random_state` here and decide your strategy for the test data.

Next up we want the classification report, for which you might want to read up here. Now, you need to get the data along with the predictions. In this case, we are making an assumption that the model is defined and predictions are served from an endpoint. We can speedup getting predictions in parallel using multiprocessing package in python.

from sklearn.metrics import classification_report, f1_scoredef get_response(obj: pd.Series) -> List[Dict[str, Any]]:
    post_data = {
        "serial_number": obj.serial_number,
        "product_description": obj.product_description,
        "use_case": obj.use_case
     }
    return [post_data]products = [get_response(row) for t, row in test.iterrows()]def predicted_product(product):
#define your function to fetch the predictions from an endpoint.#Parallel Processing
from multiprocessing import Pool
num_clients = 10
pool = Pool(num_clients)
print(pool)test_ensemble = pool.map(predictions, products)score_top_products = classification_report(test.product.values, test_ensemble, top_products, output_dict=True)

*The above is a sample block of code for reference.

$score_top_products
#These are sample outputs
                         precision    recall  f1-score   support

              Product A      0.89      0.86      0.88      4915
              Product B      0.89      0.90      0.89      1428
   

               micro avg       0.87      0.86      0.86     10000
               macro avg       0.87      0.87      0.87     10000
            weighted avg       0.87      0.86      0.86     10000

woah! You have the classification report for your training model now and currently we are looking at how the model is behaving for our top products. Similarly, you can decide the labels and identify the overall accuracy of the model or against certain categories.

Every time you train the model, you can make sure that the numbers do not drop. These tests can be included as a part of the ML pipeline same way as pre train tests.

For more visual representation of the classification report, you can use the library matplotib . There is a lot more that you can do with the tool Evidently, it helps you monitor and evaluate machine learning models, analyze the performance of a classification model, see the data drift, etc.

You can read more about determining the performance of the model on different data sets here, categorized as Performance on a Held-Out Dataset, performance on Specific Examples, Performance on criticl subpopulations.

How do you write post train tests for your Machine Learning models? Please respond in the comment section below, happy to learn..

Special thanks to Manikandan Sivanesan for all the guidance and sharing reusable code snippets.

Testing in Machine Learning [Part 2]

Getting started with Post train tests:

How to get started with Minimum Functionality Tests?

How do you calculate Precision, Recall and F1 Score?

Written by Anisha Narang

No responses yet