streamline.dataprep.data_process module

class streamline.dataprep.data_process.DataProcessing(cv_train_path, cv_test_path, experiment_path, scale_data=True, impute_data=True, multi_impute=True, overwrite_cv=True, class_label='Class', instance_label=None, random_state=None)[source]

Bases: Job

Data Processing Job Class for Scaling and Imputation of CV Datasets

Parameters:

cv_train_path –
cv_test_path –
experiment_path –
scale_data –
impute_data –
multi_impute –
overwrite_cv –
class_label –
instance_label –
random_state –

data_scaling(x_train, x_test)[source]

Conducts data scaling using scikit-learn StandardScalar method which standardizes featuers by removing the mean and scaling to unit variance.

This scaling transformation is determined (i.e. fit) based on the training dataset alone then the same scaling is applied (i.e. transform) to both the training and testing datasets. The fit scaling is pickled so that it can be applied identically to data in the future for model application.

Parameters:

x_train – pandas dataframe with train set data
x_test – pandas dataframe with test set data

Returns: Scaled x_train and x_test

impute_cv_data(x_train, x_test)[source]

Begin by imputing categorical variables with simple ‘mode’ imputation

Parameters:

x_train – pandas dataframe with train set data
x_test – pandas dataframe with test set data

Returns: Imputed x_train and x_test

load_data()[source]: Load the target training and testing datasets and return respective dataframes, feature header labels, dataset name, and specific cv partition number for this dataset pair.

run()[source]: Run all elements of the data preprocessing: data scaling and missing value imputation (mode imputation for categorical features and MICE-based iterative imputing for quantitative features)

save_runtime()[source]: Save runtime for this phase

write_cv_files(data_train, data_test)[source]

Exports new training and testing datasets following imputation and/or scaling. Includes option to overwrite original dataset (to save space) or preserve a copy of training and testing dataset with CVOnly (for comparison and quality control).

Parameters:

data_train – pandas dataframe with train set data
data_test – pandas dataframe with test set data

Returns: None