streamline.dataprep.exploratory_analysis module

class streamline.dataprep.exploratory_analysis.EDAJob(dataset, experiment_path, ignore_features=None, categorical_features=None, explorations=None, plots=None, categorical_cutoff=10, sig_cutoff=0.05, random_state=None)[source]

Bases: Job

Exploratory Data Analysis Class for the EDA/Phase 1 step of STREAMLINE

Initialization function for Exploratory Data Analysis Class. Parameters are defined below.

Parameters:
  • dataset – a streamline.utils.dataset.Dataset object or a path to dataset text file

  • experiment_path – path to experiment the logging directory folder

  • ignore_features – list of row names of features to ignore

  • categorical_features – list of row names of categorical features

  • explorations – list of names of analysis to do while doing EDA (must be in set X)

  • plots – list of analysis plots to save in experiment directory (must be in set Y)

  • categorical_cutoff – categorical cut off to consider a feature categorical by analysis, default=10

  • sig_cutoff – significance cutoff for continuous variables, default=0.05

  • random_state – random state to set seeds for reproducibility of algorithms

counts_summary(total_missing=None, plot=False, show=False)[source]

Reports various dataset counts: i.e. number of instances, total features, categorical features, quantitative features, and class counts. Also saves a simple bar graph of class counts if user specified.

Parameters:
  • total_missing – total missing values (optional, runs again if not given)

  • plot – flag to output bar graph in the experiment log folder

  • show – flag to output the bar graph in interactive interface

Returns:

describe_data()[source]

Conduct and export basic dataset descriptions including basic column statistics, column variable types (i.e. int64 vs. float64), and unique value counts for each column

drop_ignored_rowcols()[source]

Basic data cleaning: Drops any instances with a missing outcome value as well as any features (ignore_features) specified by user

feature_correlation_plot(x_data=None, show=False)[source]

Calculates feature correlations via pearson correlation and exports a respective heatmap visualization. Due to computational expense this may not be recommended for datasets with a large number of instances and/or features unless needed. The generated heatmap will be difficult to read with a large number of features in the target dataset.

Parameters:
  • x_data – data with only feature columns

  • show – flag to show plot or not

graph_selector(feature_name)[source]

Assuming a categorical class outcome, a barplot is generated given a categorical feature, and a boxplot is generated given a quantitative feature.

Parameters:

feature_name – feature name of the column the function is doing operation on

identify_feature_types(x_data=None)[source]

Automatically identify categorical vs. quantitative features/variables Takes a dataframe (of independent variables) with column labels and returns a list of column names identified as being categorical based on user defined cutoff (categorical_cutoff).

join()[source]
make_log_folders()[source]

Makes folders for logging exploratory data analysis

missing_count_plot(plot=False)[source]

Plots a histogram of missingness across all data columns.

missingness_counts()[source]

Count and export missing values for all data columns.

run(top_features=20)[source]

Wrapper function to run_explore

Parameters:

top_features – no of top features to consider (default=20)

run_explore(top_features=20)[source]

Run Exploratory Data Analysis according to EDA object

Parameters:

top_features – no of top features to consider (default=20)

save_runtime()[source]

Export runtime for this phase of the pipeline on current target dataset

start(top_features=20)[source]
test_selector(feature_name)[source]

Selects and applies appropriate univariate association test for a given feature. Returns resulting p-value

Parameters:

feature_name – name of feature column operation is running on

univariate_analysis(top_features=20)[source]

Calculates univariate association significance between each individual feature and class outcome. Assumes categorical outcome using Chi-square test for categorical features and Mann-Whitney Test for quantitative features.

Parameters:

top_features – no of top features to show/consider

univariate_plots(sorted_p_list=None, top_features=20)[source]

Checks whether p-value of each feature is less than or equal to significance cutoff. If so, calls graph_selector to generate an appropriate plot.

Parameters:
  • sorted_p_list – sorted list of p-values

  • top_features – no of top features to consider (default=20)