dataset

class virtualitics_sdk.assets.dataset.DataEncoding(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: ExtendedEnum

ONE_HOT = 'one_hot'
ORDINAL = 'ordinal'
VERBOSE = 'verbose'
class virtualitics_sdk.assets.dataset.Dataset(dataset, label, metadata=None, name=None, encoding='ordinal', one_hot_dict=None, cat_to_vals=None, categorical_cols=None, predict_cols=None, description=None, version=None)

Bases: Asset

The dataset asset allows for easy conversion between data formats when it is provided with additional inputs to convert between formats.

Parameters:
  • dataset (DataFrame) – A dataset containing numerical and categorical columns. Should be given as a pandas DataFrame.

  • label (str) – Label for Asset. See Asset documentation for more details.

  • metadata (Optional[dict]) – Asset metadata. See asset documentation for more details.

  • name (Optional[str]) – Name for Asset. See Asset documentation for more details.

  • encoding (Union[str, DataEncoding]) – DataEncoding enum to specify the data type of the given dataset. Possible values are ‘ordinal’, ‘one_hot’, or ‘verbose’. These types refer to the format of categorical features. ‘Ordinal’ means that categories are contained only in a single column and encoded with integers. ‘Verbose’ is the same format, but encoded with strings instead of integers. ‘One_hot’ means categorical features are split up into columns for each possible value and has exactly one ‘1’ in one of these feature columns.

  • one_hot_dict (Optional[Dict[str, List[str]]]) – Allows conversion to and from ‘one_hot’ encoding. This is a dictionary mapping from names of categorical features to a list of strings of the columns in the dataset which correspond to the given feature. If not provided and the dataset is given in a one hot encoding, attempts to create one_hot_dict assuming that the columns were created using pd.get_dummies.

  • cat_to_vals (Optional[Dict[str, List[Union[int, float, complex, number, str, object]]]]) – This is a dictionary mapping from names of categorical features to a list of strings representing their possible values. Even if not provided, one is inferred from the given dataset.

  • categorical_cols (Optional[List[str]]) – A list of strings representing the feature names of the category features. For an ordinal or verbose encoded dataset, it would just be the name of the column of the categorical feature. For a one_hot encoded dataset, it would be the corresponding name of the feature.

  • predict_cols (Optional[List[str]]) – Names of columns in the dataset that will be used by a model. This allows filtering of the dataframe when passing to a model even if the dataset contains extraneous columns. These columns are expected to match the provided encoding of the dataset.

  • description (Optional[str]) – Description of Asset, see its documentation for more details.

  • version (Optional[int]) – Version of Asset, see its documentation for more details.

check_valid_encoding(encoding=None)

Converts provided encoding to a DataEncoding enum. If no encoding is provided, defaults to the original encoding of the provided dataframe in initialization.

Parameters:

encoding (Union[str, DataEncoding, None]) – The encoding to be validated If None, the functions returns the default encoding provided in initialization. Can also be provided string versions of the encodings. Valid strings are ordinal, one_hot, and verbose. Defaults to None.

Raises:

ValueError – If the provided string does not match a valid DataEncoding.

Return type:

DataEncoding

Returns:

The corresponding DataEncoding enum.

convert_dtypes(X)

Converts the dtypes of the columns in X to match the dtypes of this asset’s dataset object.

Parameters:

X (DataFrame) – The dataframe whose dtypes will be converted.

Returns:

The same dataframe with converted dtypes.

convert_encoding(X, from_=None, to_=None, filter=False)

Converts the dataframe from and to the specified encodings. The dataframe should be in the encoding specified in from_. The dataframe can also be concurrently filtered to only contain prediction columns.

Parameters:
  • X (DataFrame) – The dataframe to be converted. Should be in the encoding specified in from_.

  • from – The encoding of the provided dataframe. If None, it assumes the dataframe is in the same encoding as the original provided dataframe. Can also be provided as a string version of the encoding. Defaults to None.

  • to – The encoding to convert the dataframe to. If None, it assumes the dataframe is in the same encoding as the original provided dataframe. Can also be provided as a string version of the encoding. Defaults to None.

  • filter (bool) – Whether to filter the provided data to only contain prediction columns. Defaults to False.

Raises:

ValueError – When either the from_ or to_ encodings are not supported for conversion.

Return type:

DataFrame

Returns:

The converted dataframe.

filter_data(X, encoding=None)

Filters columns of the provided dataframe so that they contain only columns used for model prediction. This filtering is only possible when this Dataset object was initialized with the predict_cols parameter.

Parameters:
  • X (DataFrame) – The dataframe which will be filtered. This dataframe should contain every column specified in the intialization of the predict_cols parameter.

  • encoding (Union[str, DataEncoding, None]) – The encoding of the provided dataframe. If None, it assumes the dataframe is in the same encoding as the original provided dataframe. Can also be provided as a string version of the encoding. Defaults to None.

Returns:

The filtered dataset.

classmethod from_csv(csv, **kwargs)

Constructs a dataset object from a csv in a bytes format. Any additional keyword arguments are passed directly to the Dataset constructor.

Parameters:

csv (BytesIO) – The csv file to turn into a dataset.

Return type:

Dataset

Returns:

The Dataset object containing the csv data.

classmethod from_excel(excel, sheet_name=0, **kwargs)

Constructs a dataset object from a excel in a bytes format. Any additional keyword arguments are passed directly to the Dataset constructor.

Parameters:
  • excel (BytesIO) – The excel file to turn into a dataset.

  • sheet_name (Union[int, str]) – The sheet to read into the dataset. Integers are interpreted as the index of the sheet, while strings are interpreted as sheet names. Defaults to 0.

Return type:

Dataset

Returns:

The Dataset object containing the excel data.

classmethod from_json(json, **kwargs)

Constructs a dataset object from a json in a bytes format. Any additional keyword arguments are passed directly to the Dataset constructor.

Parameters:

json (BytesIO) – The json file to turn into a dataset.

Return type:

Dataset

Returns:

The Dataset object containing the json data.

get_as_encoding(encoding=None, filter=False)

Returns this asset’s dataset as the specified encoding.

Parameters:
  • encoding (Union[str, DataEncoding, None]) – The encoding to convert the dataframe to. If None, it assumes the dataframe is in the same encoding as the original provided dataframe. Can also be provided as a string version of the encoding. Defaults to None.

  • filter (bool) – Whether to filter the provided data to only contain prediction columns. Defaults to False.

Return type:

DataFrame

Returns:

The dataset in the specified encoding, with additional filtering if specified.

get_categorical_names(predict_cols=False)

Returns the names of the categorical columns of this dataset. Can also optionally return only categorical columns which are also prediction columns.

Parameters:

predict_cols (bool) – Whether to reduce the set of categorical columns returned to just prediction columns. Defaults to False.

Return type:

List[str]

Returns:

The list of categorical columns.

initialize_encodings(encoding, one_hot_dict=None, cat_to_vals=None, categorical_cols=None)
Return type:

None

initialize_filters(predict_cols)