streamline.dataprep.data_process module
- class streamline.dataprep.data_process.DataProcessing(cv_train_path, cv_test_path, experiment_path, scale_data=True, impute_data=True, multi_impute=True, overwrite_cv=True, class_label='Class', instance_label=None, random_state=None)[source]
Bases:
JobData Processing Job Class for Scaling and Imputation of CV Datasets
- Parameters:
cv_train_path –
cv_test_path –
experiment_path –
scale_data –
impute_data –
multi_impute –
overwrite_cv –
class_label –
instance_label –
random_state –
- data_scaling(x_train, x_test)[source]
Conducts data scaling using scikit-learn StandardScalar method which standardizes featuers by removing the mean and scaling to unit variance.
This scaling transformation is determined (i.e. fit) based on the training dataset alone then the same scaling is applied (i.e. transform) to both the training and testing datasets. The fit scaling is pickled so that it can be applied identically to data in the future for model application.
- Parameters:
x_train – pandas dataframe with train set data
x_test – pandas dataframe with test set data
Returns: Scaled x_train and x_test
- impute_cv_data(x_train, x_test)[source]
Begin by imputing categorical variables with simple ‘mode’ imputation
- Parameters:
x_train – pandas dataframe with train set data
x_test – pandas dataframe with test set data
Returns: Imputed x_train and x_test
- load_data()[source]
Load the target training and testing datasets and return respective dataframes, feature header labels, dataset name, and specific cv partition number for this dataset pair.
- run()[source]
Run all elements of the data preprocessing: data scaling and missing value imputation (mode imputation for categorical features and MICE-based iterative imputing for quantitative features)
- write_cv_files(data_train, data_test)[source]
Exports new training and testing datasets following imputation and/or scaling. Includes option to overwrite original dataset (to save space) or preserve a copy of training and testing dataset with CVOnly (for comparison and quality control).
- Parameters:
data_train – pandas dataframe with train set data
data_test – pandas dataframe with test set data
Returns: None