streamline.dataprep.exploratory_analysis module
- class streamline.dataprep.exploratory_analysis.EDAJob(dataset, experiment_path, ignore_features=None, categorical_features=None, explorations=None, plots=None, categorical_cutoff=10, sig_cutoff=0.05, random_state=None)[source]
Bases:
JobExploratory Data Analysis Class for the EDA/Phase 1 step of STREAMLINE
Initialization function for Exploratory Data Analysis Class. Parameters are defined below.
- Parameters:
dataset – a streamline.utils.dataset.Dataset object or a path to dataset text file
experiment_path – path to experiment the logging directory folder
ignore_features – list of row names of features to ignore
categorical_features – list of row names of categorical features
explorations – list of names of analysis to do while doing EDA (must be in set X)
plots – list of analysis plots to save in experiment directory (must be in set Y)
categorical_cutoff – categorical cut off to consider a feature categorical by analysis, default=10
sig_cutoff – significance cutoff for continuous variables, default=0.05
random_state – random state to set seeds for reproducibility of algorithms
- counts_summary(total_missing=None, plot=False, show=False)[source]
Reports various dataset counts: i.e. number of instances, total features, categorical features, quantitative features, and class counts. Also saves a simple bar graph of class counts if user specified.
- Parameters:
total_missing – total missing values (optional, runs again if not given)
plot – flag to output bar graph in the experiment log folder
show – flag to output the bar graph in interactive interface
Returns:
- describe_data()[source]
Conduct and export basic dataset descriptions including basic column statistics, column variable types (i.e. int64 vs. float64), and unique value counts for each column
- drop_ignored_rowcols()[source]
Basic data cleaning: Drops any instances with a missing outcome value as well as any features (ignore_features) specified by user
- feature_correlation_plot(x_data=None, show=False)[source]
Calculates feature correlations via pearson correlation and exports a respective heatmap visualization. Due to computational expense this may not be recommended for datasets with a large number of instances and/or features unless needed. The generated heatmap will be difficult to read with a large number of features in the target dataset.
- Parameters:
x_data – data with only feature columns
show – flag to show plot or not
- graph_selector(feature_name)[source]
Assuming a categorical class outcome, a barplot is generated given a categorical feature, and a boxplot is generated given a quantitative feature.
- Parameters:
feature_name – feature name of the column the function is doing operation on
- identify_feature_types(x_data=None)[source]
Automatically identify categorical vs. quantitative features/variables Takes a dataframe (of independent variables) with column labels and returns a list of column names identified as being categorical based on user defined cutoff (categorical_cutoff).
- run(top_features=20)[source]
Wrapper function to run_explore
- Parameters:
top_features – no of top features to consider (default=20)
- run_explore(top_features=20)[source]
Run Exploratory Data Analysis according to EDA object
- Parameters:
top_features – no of top features to consider (default=20)
- test_selector(feature_name)[source]
Selects and applies appropriate univariate association test for a given feature. Returns resulting p-value
- Parameters:
feature_name – name of feature column operation is running on
- univariate_analysis(top_features=20)[source]
Calculates univariate association significance between each individual feature and class outcome. Assumes categorical outcome using Chi-square test for categorical features and Mann-Whitney Test for quantitative features.
- Parameters:
top_features – no of top features to show/consider
- univariate_plots(sorted_p_list=None, top_features=20)[source]
Checks whether p-value of each feature is less than or equal to significance cutoff. If so, calls graph_selector to generate an appropriate plot.
- Parameters:
sorted_p_list – sorted list of p-values
top_features – no of top features to consider (default=20)