Validation#

The validation module allows construction of datasets and dataset schemas. The former allows easy conversion between data formats, whereas schemas allow datasets to be checked ot ensure they conform to a prespecified format.

DataEncoding#

class predict_backend.validation.dataset.DataEncoding(value)#

Bases: ExtendedEnum

An enumeration.

ORDINAL = 'ordinal'#

ONE_HOT = 'one_hot'#

VERBOSE = 'verbose'#

Dataset#

class predict_backend.validation.dataset.Dataset(dataset=None, label=None, metadata=None, name=None, encoding='ordinal', one_hot_dict=None, cat_to_vals=None, categorical_cols=None, predict_cols=None, description=None, version=0)#

The dataset asset allows for easy conversion between data formats when it is provided with additional inputs to convert between formats.

Parameters:

dataset – A dataset containing numerical and categorical columns. Should be given as a pandas DataFrame.
label (Optional[str]) – Label for Asset. See Asset documentation for more details.
metadata (Optional[dict]) – Asset metadata. See asset documentation for more details.
name (Optional[str]) – Name for Asset. See Asset documentation for more details.
encoding (Union[str, DataEncoding]) – DataEncoding enum to specify the data type of the given dataset. Possible values are ‘ordinal’, ‘one_hot’, or ‘verbose’. These types refer to the format of categorical features. ‘Ordinal’ means that categories are contained only in a single column ane encoded with integers. ‘Verbose’ is the same format, but encoded with strings instead of integers. ‘One_hot’ means categorical features are split up into columns for each possible value and has exactly one ‘1’ in one of these feature columns.
one_hot_dict (Optional[Dict[str, List[str]]]) – Allows conversion to and from ‘one_hot’ encoding. This is a dictionary mapping from names of categorical features to a list of strings of the columns in the dataset which correspond to the given feature. If not provided and the dataset is given in a one hot encoding, attempts to create one_hot_dict assuming that the columns were created using pd.get_dummies.
cat_to_vals (Optional[Dict[str, List[Union[int, float, complex, number, str, object]]]]) – This is a dictionary mapping from names of categorical features to a list of strings representing their possible values. Even if not provided, one is inferred from the given dataset.
categorical_cols (Optional[List[str]]) – A list of strings representing the feature names of the category features. For an ordinal or verbose encoded dataset, it would just be the name of the column of the categorical feature. For a one_hot encoded dataset, it would be the corresponding name of the feature.
predict_cols (Optional[List[str]]) – Names of columns in the dataset that will be used by a model. This allows filtering of the dataframe when passing to a model even if the dataset contains extraneous columns. These columns are expected to match the provided encoding of the dataset.
description (Optional[str]) – Description of Asset, see its documentation for more details.
version (int) – Version of Asset, see its documentation for more details.

Let’s go through an example of a Dataset asset. Say that we had a dataframe, where we’ll write the same data in several different formats for use throughout the example:

>>> df_verbose = pd.DataFrame({"age(years)": [15.4, 23.6, 80.8], "pet": ["cat", "dog", "dog"], "breed": ["siamese", "beagle", "labrador"]})
>>> df_ordinal = pd.DataFrame({"age(years)": [15.4, 23.6, 80.8], "pet": [0, 1, 1], "breed": [0, 1, 2]})
>>> df_onehot = pd.get_dummies(df_verbose)

For the first example, we can create a Dataset object around the verbose dataset

>>> dataset_verbose = Dataset(dataset=df_verbose, encoding=DataEncoding.VERBOSE, label="an example flow", name="verbose dataset")

This creates a dataset with a verbose encoding. Providing a dataset with this encoding automatically allows conversion to Ordinal encodings:

>>> df_converted_oridnal = dataset_verbose.get_as_encoding(encoding.ORDINAL)
>>> df_converted_oridnal.equals(df_ordinal)
True

For our second example, we provide the Dataset object with an ordinal dataframe. With no additional parameters, the corresponding verbose encoding would just be the same as the ordinal encoding:

>>> dataset_ordinal = Dataset(df_ordinal, encoding=DataEncoding.ORDINAL, label="an example flow", name="ordinal dataset")
>>> df_converted_verbose = dataset_ordinal.get_as_encoding(encoding.VERBOSE)
>>> df_converted_verbose.equals(df_ordinal)
True

If we instead wanted to convert to a verbose encoding, we would need to provide an additional conversion parameter called cat_to_vals:

>>> cat_to_vals = {"pet": ["cat", "dog"], "breed": ["siamese", "beagle", "labrador"]}
>>> dataset_ordinal = Dataset(df_ordinal, encoding=DataEncoding.ORDINAL, cat_to_vals=cat_to_vals, label="an example flow", name="ordinal dataset")
>>> df_converted_verbose = dataset_ordinal.get_as_encoding(encoding.VERBOSE)
>>> df_converted_verbose.equals(df_verbose)
True

The parameter is a dictionary mapping category names to their possible labels. The labels should be in the same order as the ordinal encoding, so 0, 1, 2 maps to “siamese”, “beagle”, “labrador” respectively because that’s the order in which they appear in the list.

For our last example, we will create a dataset from a one hot encoded object. In order to convert to any other encoding type (which is especially necessary for use with an explainer) we need to provide additional parameters:

>>> one_hot_dict = {"pet": ["pet_cat", "pet_dog"], "breed": ["breed_siamese", "breed_beagle", "breed_labrador"]}
>>> cat_to_vals = {"pet": ["cat", "dog"], "breed": ["siamese", "beagle", "labrador"]}
>>> dataset_onehot = Dataset(df_ordinal, encoding=DataEncoding.ONE_HOT, one_hot_dict=one_hot_dict, cat_to_vals=cat_to_vals, label="an example flow", name="onehot dataset")
>>> df_converted_verbose = dataset_onehot.get_as_encoding(encoding.VERBOSE)
>>> df_converted_oridnal = dataset_onehot.get_as_encoding(encoding.ORDINAL)
>>> df_converted_verbose.equals(df_verbose)
True
>>> df_converted_oridnal.equals(df_ordinal)
True

The one_hot_dict parameter allows conversion to and from the other format types. It is a a map from a category to a list of the corresponding one hot columns in the dataframe. As in the case of cat_to_vals, the ordering of the list determines the ordering of the correpsonding ordinal encoding and the ordering of the lists in one_hot_dict and cat_to_vals are expected to correspond. The cat_to_vals parameter is the same as in the ordinal case, but if it is not provided the Dataset object will infer just by setting cat_to_vals to be the same dictionary as one_hot_dict.

These parameters are clunky and not necessary most of the time if the one hot encodings are performed in a standard way. If the dataframe’s one hot encoding was created using pd.get_dummies as in the case of this example, we just need to provide the name of the cateogry:

>>> categorical_cols = ["pet", "breed"]
>>> dataset_onehot = Dataset(df_ordinal, encoding=DataEncoding.ONE_HOT, categorical_cols=categorical_cols, label="an example flow", name="onehot dataset")
>>> df_converted_verbose = dataset_onehot.get_as_encoding(encoding.VERBOSE)
>>> df_converted_oridnal = dataset_onehot.get_as_encoding(encoding.ORDINAL)
>>> df_converted_verbose.equals(df_verbose)
True
>>> df_converted_oridnal.equals(df_ordinal)
True

Schema#

class predict_backend.validation.schema.Schema(schema=None, exact_match=False, valid_inputs=None, label=None, metadata=None, name=None, description=None, version=0)#

A schema asset allows validation of a DataFrame according to a pre-specified schema. A schema is specified using a pd.Series object with indices names as expected columns in the dataframe and values as their expected dtypes. There are 2 levels of validation that can be performed. The first “level 1” validation ensures that the dataframe contains all the expected columns specified in the schema, and optionally ensures that no additional columns are present. It also ensures that each column’s dtype matches the schema. The optional “level 2” validation performs additional checks, ensuring that numerical columns are within a specified range of values and that categorical columns take on a value from a specified list. In the event that a check fails, an exception is raised. Otherwise the validation function returns true.

Parameters:

schema (Optional[Series]) – A schema object. Indices should be expected column names and values should be expected dtypes.
exact_match (bool) – If true, the dataframe must not have any additional columns not expected in the schema or else an exception will be raised. If performing level 2 validation, the ‘valid_inputs’ variable must have key/value entries for each column in the schema. If false, these checks will be ignored.
valid_inputs (Optional[Dict[str, List]]) – A dictionary mapping column names (str) to lists describing their valid inputs. For columns with a numerical dtype, the value is expected to be [min, max] where min and max is the minimum and maximum possible values in the column respectively. For columns with a “object” dtype (i.e. string/categorical columns) the value is expected to be a list of all possible values in the column. If valid_inputs is None, this level 2 validation will not be performed.
label (Optional[str]) – Label for Asset, see its documentation for more details.
metadata (Optional[dict]) – Metadata for Asset, see its documentation for more details.
name (Optional[str]) – Name for Asset, see its documentation for more details.
description (Optional[str]) – Description of Asset, see its documentation for more details.
version (int) – Version for Asset, see its documentation for more details.