streamline.dataprep.exploratory_analysis module

class streamline.dataprep.exploratory_analysis.EDAJob(dataset, experiment_path, ignore_features=None, categorical_features=None, explorations=None, plots=None, categorical_cutoff=10, sig_cutoff=0.05, random_state=None)[source]

Bases: Job

Exploratory Data Analysis Class for the EDA/Phase 1 step of STREAMLINE

Initialization function for Exploratory Data Analysis Class. Parameters are defined below.

Parameters:

dataset – a streamline.utils.dataset.Dataset object or a path to dataset text file
experiment_path – path to experiment the logging directory folder
ignore_features – list of row names of features to ignore
categorical_features – list of row names of categorical features
explorations – list of names of analysis to do while doing EDA (must be in set X)
plots – list of analysis plots to save in experiment directory (must be in set Y)
categorical_cutoff – categorical cut off to consider a feature categorical by analysis, default=10
sig_cutoff – significance cutoff for continuous variables, default=0.05
random_state – random state to set seeds for reproducibility of algorithms

counts_summary(total_missing=None, plot=False, show=False)[source]

Reports various dataset counts: i.e. number of instances, total features, categorical features, quantitative features, and class counts. Also saves a simple bar graph of class counts if user specified.

Parameters:

total_missing – total missing values (optional, runs again if not given)
plot – flag to output bar graph in the experiment log folder
show – flag to output the bar graph in interactive interface

Returns:

describe_data()[source]: Conduct and export basic dataset descriptions including basic column statistics, column variable types (i.e. int64 vs. float64), and unique value counts for each column

drop_ignored_rowcols()[source]: Basic data cleaning: Drops any instances with a missing outcome value as well as any features (ignore_features) specified by user

feature_correlation_plot(x_data=None, show=False)[source]

Calculates feature correlations via pearson correlation and exports a respective heatmap visualization. Due to computational expense this may not be recommended for datasets with a large number of instances and/or features unless needed. The generated heatmap will be difficult to read with a large number of features in the target dataset.

Parameters:

x_data – data with only feature columns
show – flag to show plot or not

graph_selector(feature_name)[source]

Assuming a categorical class outcome, a barplot is generated given a categorical feature, and a boxplot is generated given a quantitative feature.

Parameters:: feature_name – feature name of the column the function is doing operation on

identify_feature_types(x_data=None)[source]: Automatically identify categorical vs. quantitative features/variables Takes a dataframe (of independent variables) with column labels and returns a list of column names identified as being categorical based on user defined cutoff (categorical_cutoff).

join()[source]

make_log_folders()[source]: Makes folders for logging exploratory data analysis

missing_count_plot(plot=False)[source]: Plots a histogram of missingness across all data columns.

missingness_counts()[source]: Count and export missing values for all data columns.

run(top_features=20)[source]

Wrapper function to run_explore

Parameters:: top_features – no of top features to consider (default=20)

run_explore(top_features=20)[source]

Run Exploratory Data Analysis according to EDA object

Parameters:: top_features – no of top features to consider (default=20)

save_runtime()[source]: Export runtime for this phase of the pipeline on current target dataset

start(top_features=20)[source]

test_selector(feature_name)[source]

Selects and applies appropriate univariate association test for a given feature. Returns resulting p-value

Parameters:: feature_name – name of feature column operation is running on

univariate_analysis(top_features=20)[source]

Calculates univariate association significance between each individual feature and class outcome. Assumes categorical outcome using Chi-square test for categorical features and Mann-Whitney Test for quantitative features.

Parameters:: top_features – no of top features to show/consider

univariate_plots(sorted_p_list=None, top_features=20)[source]

Checks whether p-value of each feature is less than or equal to significance cutoff. If so, calls graph_selector to generate an appropriate plot.

Parameters:

sorted_p_list – sorted list of p-values
top_features – no of top features to consider (default=20)