dataset#

class predict_backend.validation.dataset.DataEncoding(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)#

Bases: ExtendedEnum

ONE_HOT = 'one_hot'#
ORDINAL = 'ordinal'#
VERBOSE = 'verbose'#
class predict_backend.validation.dataset.Dataset(dataset, label, metadata=None, name=None, encoding='ordinal', one_hot_dict=None, cat_to_vals=None, categorical_cols=None, predict_cols=None, description=None, version=None)#

Bases: Asset

The dataset asset allows for easy conversion between data formats when it is provided with additional inputs to convert between formats.

Parameters:
  • dataset (DataFrame) – A dataset containing numerical and categorical columns. Should be given as a pandas DataFrame.

  • label (str) – Label for Asset. See Asset documentation for more details.

  • metadata (Optional[dict]) – Asset metadata. See asset documentation for more details.

  • name (Optional[str]) – Name for Asset. See Asset documentation for more details.

  • encoding (Union[str, DataEncoding]) – DataEncoding enum to specify the data type of the given dataset. Possible values are ‘ordinal’, ‘one_hot’, or ‘verbose’. These types refer to the format of categorical features. ‘Ordinal’ means that categories are contained only in a single column and encoded with integers. ‘Verbose’ is the same format, but encoded with strings instead of integers. ‘One_hot’ means categorical features are split up into columns for each possible value and has exactly one ‘1’ in one of these feature columns.

  • one_hot_dict (Optional[Dict[str, List[str]]]) – Allows conversion to and from ‘one_hot’ encoding. This is a dictionary mapping from names of categorical features to a list of strings of the columns in the dataset which correspond to the given feature. If not provided and the dataset is given in a one hot encoding, attempts to create one_hot_dict assuming that the columns were created using pd.get_dummies.

  • cat_to_vals (Optional[Dict[str, List[Union[int, float, complex, number, str, object]]]]) – This is a dictionary mapping from names of categorical features to a list of strings representing their possible values. Even if not provided, one is inferred from the given dataset.

  • categorical_cols (Optional[List[str]]) – A list of strings representing the feature names of the category features. For an ordinal or verbose encoded dataset, it would just be the name of the column of the categorical feature. For a one_hot encoded dataset, it would be the corresponding name of the feature.

  • predict_cols (Optional[List[str]]) – Names of columns in the dataset that will be used by a model. This allows filtering of the dataframe when passing to a model even if the dataset contains extraneous columns. These columns are expected to match the provided encoding of the dataset.

  • description (Optional[str]) – Description of Asset, see its documentation for more details.

  • version (Optional[int]) – Version of Asset, see its documentation for more details.

check_valid_encoding(encoding=None)#

Converts provided encoding to a DataEncoding enum. If no encoding is provided, defaults to the original encoding of the provided dataframe in initialization.

Parameters:

encoding (Union[str, DataEncoding, None]) – The encoding to be validated If None, the functions returns the default encoding provided in initialization. Can also be provided string versions of the encodings. Valid strings are ordinal, one_hot, and verbose. Defaults to None.

Raises:

ValueError – If the provided string does not match a valid DataEncoding.

Return type:

DataEncoding

Returns:

The corresponding DataEncoding enum.

convert_dtypes(X)#

Converts the dtypes of the columns in X to match the dtypes of this asset’s dataset object.

Parameters:

X (DataFrame) – The dataframe whose dtypes will be converted.

Returns:

The same dataframe with converted dtypes.

convert_encoding(X, from_=None, to_=None, filter=False)#

Converts the dataframe from and to the specified encodings. The dataframe should be in the encoding specified in from_. The dataframe can also be concurrently filtered to only contain prediction columns.

Parameters:
  • X (DataFrame) – The dataframe to be converted. Should be in the encoding specified in from_.

  • from – The encoding of the provided dataframe. If None, it assumes the dataframe is in the same encoding as the original provided dataframe. Can also be provided as a string version of the encoding. Defaults to None.

  • to – The encoding to convert the dataframe to. If None, it assumes the dataframe is in the same encoding as the original provided dataframe. Can also be provided as a string version of the encoding. Defaults to None.

  • filter (Optional[bool]) – Whether to filter the provided data to only contain prediction columns. Defaults to False.

Raises:

ValueError – When either the from_ or to_ encodings are not supported for conversion.

Return type:

DataFrame

Returns:

The converted dataframe.

filter_data(X, encoding=None)#

Filters columns of the provided dataframe so that they contain only columns used for model prediction. This filtering is only possible when this Dataset object was initialized with the predict_cols parameter.

Parameters:
  • X (DataFrame) – The dataframe which will be filtered. This dataframe should contain every column specified in the intialization of the predict_cols parameter.

  • encoding (Union[str, DataEncoding, None]) – The encoding of the provided dataframe. If None, it assumes the dataframe is in the same encoding as the original provided dataframe. Can also be provided as a string version of the encoding. Defaults to None.

Returns:

The filtered dataset.

classmethod from_csv(csv, **kwargs)#

Constructs a dataset object from a csv in a bytes format. Any additional keyword arguments are passed directly to the Dataset constructor.

Parameters:

csv (BytesIO) – The csv file to turn into a dataset.

Return type:

Dataset

Returns:

The Dataset object containing the csv data.

classmethod from_excel(excel, sheet_name=0, **kwargs)#

Constructs a dataset object from a excel in a bytes format. Any additional keyword arguments are passed directly to the Dataset constructor.

Parameters:
  • excel (BytesIO) – The excel file to turn into a dataset.

  • sheet_name (Union[int, str]) – The sheet to read into the dataset. Integers are interpreted as the index of the sheet, while strings are interpreted as sheet names. Defaults to 0.

Return type:

Dataset

Returns:

The Dataset object containing the excel data.

classmethod from_json(json, **kwargs)#

Constructs a dataset object from a json in a bytes format. Any additional keyword arguments are passed directly to the Dataset constructor.

Parameters:

json (BytesIO) – The json file to turn into a dataset.

Return type:

Dataset

Returns:

The Dataset object containing the json data.

get_as_encoding(encoding=None, filter=False)#

Returns this asset’s dataset as the specified encoding.

Parameters:
  • encoding (Union[str, DataEncoding, None]) – The encoding to convert the dataframe to. If None, it assumes the dataframe is in the same encoding as the original provided dataframe. Can also be provided as a string version of the encoding. Defaults to None.

  • filter (bool) – Whether to filter the provided data to only contain prediction columns. Defaults to False.

Return type:

DataFrame

Returns:

The dataset in the specified encoding, with additional filtering if specified.

get_categorical_names(predict_cols=False)#

Returns the names of the categorical columns of this dataset. Can also optionally return only categorical columns which are also prediction columns.

Parameters:

predict_cols (bool) – Whether to reduce the set of categorical columns returned to just prediction columns. Defaults to False.

Return type:

List[str]

Returns:

The list of categorical columns.

Let’s go through an example of a Dataset asset. Say that we had a dataframe, where we’ll write the same data in several different formats for use throughout the example:

>>> df_verbose = pd.DataFrame({"age(years)": [15.4, 23.6, 80.8], "pet": ["cat", "dog", "dog"], "breed": ["siamese", "beagle", "labrador"]})
>>> df_ordinal = pd.DataFrame({"age(years)": [15.4, 23.6, 80.8], "pet": [0, 1, 1], "breed": [0, 1, 2]})
>>> df_onehot = pd.get_dummies(df_verbose)

For the first example, we can create a Dataset object around the verbose dataset:

>>> dataset_verbose = Dataset(dataset=df_verbose, encoding=DataEncoding.VERBOSE, label="an example flow", name="verbose dataset")

This creates a dataset with a verbose encoding. Providing a dataset with this encoding automatically allows conversion to Ordinal encodings:

>>> df_converted_oridnal = dataset_verbose.get_as_encoding(encoding.ORDINAL)
>>> df_converted_oridnal.equals(df_ordinal)
True

For our second example, we provide the Dataset object with an ordinal dataframe. With no additional parameters, the corresponding verbose encoding would just be the same as the ordinal encoding:

>>> dataset_ordinal = Dataset(df_ordinal, encoding=DataEncoding.ORDINAL, label="an example flow", name="ordinal dataset")
>>> df_converted_verbose = dataset_ordinal.get_as_encoding(encoding.VERBOSE)
>>> df_converted_verbose.equals(df_ordinal)
True

If we instead wanted to convert to a verbose encoding, we would need to provide an additional conversion parameter called cat_to_vals:

>>> cat_to_vals = {"pet": ["cat", "dog"], "breed": ["siamese", "beagle", "labrador"]}
>>> dataset_ordinal = Dataset(df_ordinal, encoding=DataEncoding.ORDINAL, cat_to_vals=cat_to_vals, label="an example flow", name="ordinal dataset")
>>> df_converted_verbose = dataset_ordinal.get_as_encoding(encoding.VERBOSE)
>>> df_converted_verbose.equals(df_verbose)
True

The parameter is a dictionary mapping category names to their possible labels. The labels should be in the same order as the ordinal encoding, so 0, 1, 2 maps to “siamese”, “beagle”, “labrador” respectively because that’s the order in which they appear in the list.

For our last example, we will create a dataset from a one hot encoded object. In order to convert to any other encoding type (which is especially necessary for use with an explainer), we need to provide additional parameters:

>>> one_hot_dict = {"pet": ["pet_cat", "pet_dog"], "breed": ["breed_siamese", "breed_beagle", "breed_labrador"]}
>>> cat_to_vals = {"pet": ["cat", "dog"], "breed": ["siamese", "beagle", "labrador"]}
>>> dataset_onehot = Dataset(df_ordinal, encoding=DataEncoding.ONE_HOT, one_hot_dict=one_hot_dict, cat_to_vals=cat_to_vals, label="an example flow", name="onehot dataset")
>>> df_converted_verbose = dataset_onehot.get_as_encoding(encoding.VERBOSE)
>>> df_converted_oridnal = dataset_onehot.get_as_encoding(encoding.ORDINAL)
>>> df_converted_verbose.equals(df_verbose)
True
>>> df_converted_oridnal.equals(df_ordinal)
True

The one_hot_dict parameter allows conversion to and from the other format types. It is a a map from a category to a list of the corresponding one hot columns in the dataframe. As in the case of cat_to_vals, the ordering of the list determines the ordering of the correpsonding ordinal encoding and the ordering of the lists in one_hot_dict and cat_to_vals are expected to correspond. The cat_to_vals parameter is the same as in the ordinal case, but if it is not provided the Dataset object will infer just by setting cat_to_vals to be the same dictionary as one_hot_dict.

These parameters are clunky and not necessary most of the time if the one hot encodings are performed in a standard way. If the dataframe’s one hot encoding was created using pd.get_dummies as in the case of this example, we just need to provide the name of the category:

>>> categorical_cols = ["pet", "breed"]
>>> dataset_onehot = Dataset(df_ordinal, encoding=DataEncoding.ONE_HOT, categorical_cols=categorical_cols, label="an example flow", name="onehot dataset")
>>> df_converted_verbose = dataset_onehot.get_as_encoding(encoding.VERBOSE)
>>> df_converted_oridnal = dataset_onehot.get_as_encoding(encoding.ORDINAL)
>>> df_converted_verbose.equals(df_verbose)
True
>>> df_converted_oridnal.equals(df_ordinal)
True