streamline.utils.dataset module

class streamline.utils.dataset.Dataset(dataset_path, class_label, match_label=None, instance_label=None)[source]

Bases: object

Creates dataset with path of tabular file

Parameters:
  • dataset_path – path of tabular file (as csv, tsv, or txt)

  • class_label – column label for the outcome to be predicted in the dataset

  • match_label – column to identify unique groups of instances in the dataset that have been ‘matched’ as part of preparing the dataset with cases and controls that have been matched for some co-variates Match label is really only used in the cross validation partitioning It keeps any set of instances with the same match label value in the same partition.

  • instance_label – Instance label is mostly used by the rule based learner in modeling, we use it to trace back heterogeneous subgroups to the instances in the original dataset

clean_data(ignore_features)[source]

Basic data cleaning: Drops any instances with a missing outcome value as well as any features (ignore_features) specified by user

feature_only_data()[source]

Create features-only version of dataset for some operations Returns: dataframe x_data with only features

get_outcome()[source]

Function to get outcome value form data Returns: outcome column

load_data()[source]

Function to load data in dataset

non_feature_data()[source]

Create non features version of dataset for some operations Returns: dataframe y_data with only non features

set_headers(experiment_path, phase='exploratory')[source]

Exports dataset header labels for use as a reference later in the pipeline.

Returns: list of headers labels