Validation Methods

Validation methods are used to check the robustness and accuracy of a model and diagnose if a model is overfitting or underfitting.

This file is part of SensiML™ Piccolo AI™.

SensiML Piccolo AI is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

SensiML Piccolo AI is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with SensiML Piccolo AI. If not, see <https://www.gnu.org/licenses/>.

Stratified K-Fold Cross-Validation

A variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set. In other words, for a data set consisting of total 100 samples with 40 samples from class 1 and 60 samples from class 2, for a stratified 2-fold scheme, each fold will consist of total 50 samples with 20 samples from class 1 and 30 samples from class 2.

Parameters

number_of_folds (int) – the number of stratified folds to produce
test_size (float) – the percentage of data to hold out as a final test set
shuffle (bool) – Specifies whether or not to shuffle the data before performing the cross-fold validation splits.

Leave-One-Subject-Out

A cross-validation scheme which holds out the samples for all but one subject for testing in each fold. In other words, for a data set consisting of 10 subjects, each fold will consist of a training set from 9 subjects and test set from 1 subject; thus, in all, there will be 10 folds, one for each left out test subject.

Parameters: group_columns (list[str]) – list of column names that define the groups (subjects)

Stratified Metadata k-fold

K-fold iterator variant with non-overlapping metadata/group and label combination which also attempts to evenly distribute the number of each class across each fold. This is similar to GroupKFold, where, you cannot have the same group in in multiple folds, but in this case you cannot have the same group and label combination across multiple folds.

The main use case is for time series data where you may have a Subject group, where the subject performs several activities. If you build a model using a sliding window to segment data, you will end up with “Subject A” performing “action 1” many times. If you use a validation method that splits up “Subject A” performing “action 1” into different folds it can often result in data leakage and overfitting. If however, you build your validation set such that “Subject A” performing “action 1” is only in a single fold you can be more confident that your model is generalizing. This validation will also attempt to ensure you have a similar amount of “action 1’s” across your folds.

Parameters

number_of_folds (int) – the number of stratified folds to produce
metadata_name (str) – the metadata to group on for splitting data into folds.

Metadata k-fold

K-fold iterator variant with non-overlapping metadata groups. The same group will not appear in two different folds (the number of distinct groups has to be at least equal to the number of folds). The folds are approximately balanced in the sense that the number of distinct groups is approximately the same in each fold.

Parameters

number_of_folds (int) – the number of stratified folds to produce
metadata_name (str) – the metadata to group on for splitting data into folds.

Recall: The simplest validation method, wherein the training set itself is used as the test set. In other words, for a data set consisting of 100 samples in total, both the training set and the test set consist of the same set of 100 samples.

Set Sample Validation

A validation scheme wherein the data set is divided into training and test sets based on two statistical parameters, mean and standard deviation. The user selects the number of events in each category and has the option to select the subset mean, standard deviation, number in the validation set and the acceptable limit in the number of retries of random selection from the original data set.

Example

samples = {‘Class 1’:2500, “Class 2”:2500} validation = {‘Class 1’:2000, “Class 2”:2000}

client.pipeline.set_validation_method({“name”: “Set Sample Validation”,

“inputs”: {“samples_per_class”: samples,: “validation_samples_per_class”: validation}})

Parameters

data_set_mean (numpy.array[floats]) – mean value of each feature in dataset
data_set_stdev (numpy.array[floats]) – standard deviation of each feature in dataset
samples_per_class (dict) – Number of members in subset for training, validation, and testing
validation_samples_per_class (dict) – Overrides the number of members in subset for validation if not empty
mean_limit (numpy.array[floats]) – minimum acceptable difference between mean of subset and data for any feature
stdev_limit (numpy.array[floats]) – minimum acceptable difference between standard deviation of subset and data for any feature
retries (int) – Number of attempts to find a subset with similar statistics
norm (list[str]) – [‘Lsup’,’L1’] Distance norm for determining whether subset is within user defined limits
optimize_mean_std (list[str]) – [‘both’,’mean’] Logic to use for optimizing subset. If ‘mean’, then only mean distance must be improved. If ‘both’, then both mean and stdev must improve.
binary_class1 (str) – Category name that will be the working class in set composition

Split by Metadata Value

A validation scheme wherein the data set is divided into training and test sets based on the metadata value. In other words, for a data set consisting of 100 samples with the metadata column set to ‘train’ for 60 samples, and ‘test’ for 40 samples, the training set will consist of 60 samples for which the metadata value is ‘train’ and the test set will consist of 40 samples for which the metadata value is ‘test’.

Parameters

metadata_name (str) – name of the metadata column to use for splitting
training_values (list[str]) – list of values of the named column to select samples for training
validation_values (list[str)) – list of values of the named column to select samples for validation

Stratified Shuffle Split

A validation scheme which splits the data set into training, validation, and (optionally) test sets based on the parameters provided, with similar distribution of labels (hence stratified).

In other words, for a data set consisting of 100 samples in total with 40 samples from class 1 and 60 samples from class 2, for stratified shuffle split with validation_size = 0.4, the validation set will consist of 40 samples with 16 samples from class 1 and 24 samples from class 2, and the training set will consist of 60 samples with 24 samples from class 1 and 36 samples from class 2.

For each fold, training and validation data re-shuffle and split.

Parameters

test_size (float) – target percent of total size to use for testing
validation_size (float) – target percent of total size to use for validation
number_of_folds (int) – the number of stratified folds (iteration) to produce