API Reference¶

Foreshadow¶

Core end-to-end pipeline, foreshadow.

class Foreshadow(X_preparer=None, y_preparer=None, estimator=None, optimizer=None, optimizer_kwargs=None)[source]¶

An end-to-end pipeline to preprocess and tune a machine learning model.

Example

>>> shadow = Foreshadow()

Parameters:

X_preparer (Preprocessor, optional) – Preprocessor instance that will apply to X data. Passing False prevents the automatic generation of an instance.
y_preparer (Preprocessor, optional) – Preprocessor instance that will apply to y data. Passing False prevents the automatic generation of an instance.
estimator (sklearn.base.BaseEstimator, optional) – Estimator instance to fit on processed data
optimizer (sklearn.grid_search.BaseSeachCV, optional) – Optimizer class to optimize feature engineering and model hyperparameters

X_preparer¶

Preprocessor object for performing feature engineering on X data.

Getter:	Returns Preprocessor object
Setter:	Verifies Preprocessor object, if None, creates a default Preprocessor
Type:	`Preprocessor`

y_preparer¶

Preprocessor object for performing scaling and encoding on Y data.

Getter:	Returns Preprocessor object
Setter:	Verifies Preprocessor object, if None, creates a default Preprocessor
Type:	`Preprocessor`

estimator¶

Estimator object for fitting preprocessed data.

Getter:	Returns Estimator object
Setter:	Verifies Estimator object. If None, an `AutoEstimator` object is created in place.
Type:	`sklearn.base.BaseEstimator`

optimizer¶

Optimizer class that will fit the model.

Performs a grid or random search algorithm on the parameter space from the preprocessors and estimators in the pipeline

Getter:	Returns optimizer class
Setter:	Verifies Optimizer class, defaults to None

fit(data_df, y_df)[source]¶

Fit the Foreshadow instance using the provided input data.

Parameters:	data_df (`DataFrame`) – The input feature(s) y_df (`DataFrame`) – The response feature(s)
Returns:	The fitted instance.
Return type:	`Foreshadow`

predict(data_df)[source]¶

Use the trained estimator to predict the response variable.

Parameters:	data_df (`DataFrame`) – The input feature(s)
Returns:	The response feature(s) (transformed if necessary)
Return type:	`DataFrame`

predict_proba(data_df)[source]¶

Use the trained estimator to predict the response variable.

Uses the predicted confidences instead of binary predictions.

Parameters:	data_df (`DataFrame`) – The input feature(s)
Returns:	The probability associated with each response feature
Return type:	`DataFrame`

score(data_df, y_df=None, sample_weight=None)[source]¶

Use the trained estimator to compute the evaluation score.

The scoding method is defined by the selected estimator.

Parameters:	data_df (`DataFrame`) – The input feature(s) y_df (`DataFrame`, optional) – The response feature(s) sample_weight (`numpy.ndarray`, optional) – The weights to be used when scoring each sample
Returns:	A computed prediction fitness score
Return type:	float

dict_serialize(deep=False)[source]¶

Serialize the init parameters of the foreshadow object.

Parameters:	deep (bool) – If True, will return the parameters for this estimator recursively
Returns:	The initialization parameters of the foreshadow object.
Return type:	dict

classmethod dict_deserialize(data)[source]¶

Deserialize the dictionary form of a foreshadow object.

Parameters:	data – The dictionary to parse as foreshadow object is constructed.
Returns:	A re-constructed foreshadow object.
Return type:	object

get_params(deep=True)[source]¶

Get params for this object. See super.

Parameters:	deep – True to recursively call get_params, False to not.
Returns:	params for this object.

set_params(**params)[source]¶

Set params for this object. See super.

Parameters:	**params – params to set.
Returns:	See super.

dp¶

Intents¶

Intents package used by IntentMapper PreparerStep.

class Categoric[source]¶

Defines a categoric column type.

confidence_computation = {<class 'foreshadow.metrics.MetricWrapper' with function 'num_valid' object at 140236109800504>: 0.25, <class 'foreshadow.metrics.MetricWrapper' with function 'unique_heur' object at 140236109800560>: 0.65, <class 'foreshadow.metrics.MetricWrapper' with function 'is_numeric' object at 140236109800616>: 0.1}¶

fit(X, y=None, **fit_params)[source]¶

Empty fit.

Parameters:	X – The input data y – The response variable **fit_params – Additional parameters for the fit
Returns:	self

transform(X, y=None)[source]¶

Pass-through transform.

Parameters:	X – The input data y – The response variable
Returns:	The input column

classmethod column_summary(df)[source]¶

class Numeric[source]¶

Defines a numeric column type.

confidence_computation = {<class 'foreshadow.metrics.MetricWrapper' with function 'num_valid' object at 140236109801400>: 0.3, <class 'foreshadow.metrics.MetricWrapper' with function 'unique_heur' object at 140236109801456>: 0.2, <class 'foreshadow.metrics.MetricWrapper' with function 'is_numeric' object at 140236109801512>: 0.4, <class 'foreshadow.metrics.MetricWrapper' with function 'is_string' object at 140236109801568>: 0.1}¶

fit(X, y=None, **fit_params)[source]¶

Empty fit.

Parameters:	X – The input data y – The response variable **fit_params – Additional parameters for the fit
Returns:	self

transform(X, y=None)[source]¶

Convert a column to a numeric form.

Parameters:	X – The input data y – The response variable
Returns:	A column with all rows converted to numbers.

classmethod column_summary(df)[source]¶

class Text[source]¶

Defines a text column type.

confidence_computation = {<class 'foreshadow.metrics.MetricWrapper' with function 'num_valid' object at 140236109802184>: 0.2, <class 'foreshadow.metrics.MetricWrapper' with function 'unique_heur' object at 140236109802240>: 0.2, <class 'foreshadow.metrics.MetricWrapper' with function 'is_numeric' object at 140236109802296>: 0.2, <class 'foreshadow.metrics.MetricWrapper' with function 'is_string' object at 140236109802352>: 0.2, <class 'foreshadow.metrics.MetricWrapper' with function 'has_long_text' object at 140236109802408>: 0.2}¶

fit(X, y=None, **fit_params)[source]¶

Empty fit.

Parameters:	X – The input data y – The response variable **fit_params – Additional parameters for the fit
Returns:	self

transform(X, y=None)[source]¶

Convert a column to a text form.

Parameters:	X – The input data y – The response variable
Returns:	A column with all rows converted to text.

classmethod column_summary(df)[source]¶

class BaseIntent[source]¶

Base for all intent definitions.

For each intent subclass a class attribute called confidence_computation must be defined which is of the form:

{
     metric_def: weight
}

classmethod get_confidence(X, y=None)[source]¶

Determine the confidence for an intent match.

Parameters:	X – input DataFrame. y – response variable
Returns:	A confidence value bounded between 0.0 and 1.0
Return type:	float

classmethod column_summary(df)[source]¶

Transformers¶

Internal Transformers¶

Smart Transformers¶

Transformer Bases¶

Estimators¶

Estimators provided by foreshadow.

class AutoEstimator(problem_type=None, auto=None, include_preprocessors=False, estimator_kwargs=None)[source]¶

A wrapped estimator that selects the solution for a given problem.

By default each automatic machine learning solution runs for 1 minute but that can be changed through passed kwargs. Autosklearn is not required for this to work but if installed it can be used alongside TPOT.

Parameters:	problem_type (str) – The problem type, ‘regression’ or ‘classification’ auto (str) – The automatic estimator, ‘tpot’ or ‘autosklearn’ include_preprocessors (bool) – Whether include preprocessors in AutoML pipelines estimator_kwargs (dict) – A dictionary of args to pass to the specified auto estimator (both problem_type and auto must be specified)

problem_type¶

Type of machine learning problem.

Either regression or classification.

Returns:	self._problem_type

auto¶

Type of automl package.

Either tpot or autosklearn.

Returns:	self._auto, the type of automl package

estimator_kwargs¶

Get dictionary of kwargs to pass to AutoML package.

Returns:	estimator kwargs

configure_estimator(y)[source]¶

Construct and return the auto estimator instance.

Parameters:	y – input labels
Returns:	autoestimator instance

fit(X, y)[source]¶

Fit the AutoEstimator instance.

Uses the selected AutoML estimator.

Parameters:	X (pandas.DataFrame or numpy.ndarray or list) – The input feature(s) y (pandas.DataFrame or numpy.ndarray or list) – The response feature(s)
Returns:	The selected estimator

predict(X)[source]¶

Use the trained estimator to predict the response.

Parameters:	X (pandas.DataFrame or numpy.ndarray or list) – The input feature(s)
Returns:	The response feature(s)
Return type:	pandas.DataFrame

predict_proba(X)[source]¶

Use the trained estimator to predict the responses probabilities.

Parameters:	X (pandas.DataFrame or numpy.ndarray or list) – The input feature(s)
Returns:	The probability associated with each response feature
Return type:	pandas.DataFrame

score(X, y, sample_weight=None)[source]¶

Use the trained estimator to compute the evaluation score.

Note: sample weights are not supported

Parameters:	X (pandas.DataFrame or numpy.ndarray or list) – The input feature(s) y (pandas.DataFrame or numpy.ndarray or list) – The response feature(s) sample_weight – sample weighting. Not implemented.
Returns:	A computed prediction fitness score
Return type:	float

class MetaEstimator(estimator, preprocessor)[source]¶

Wrapper that allows data preprocessing on the response variable(s).

Parameters:	estimator – An instance of a subclass of `sklearn.base.BaseEstimator` preprocessor – An instance of `foreshadow.preprocessor.Preprocessor`

dict_serialize(deep=False)[source]¶

Serialize the init parameters (dictionary form) of a transformer.

Parameters:	deep (bool) – If True, will return the parameters for this estimator recursively
Returns:	The initialization parameters of the transformer.
Return type:	dict

fit(X, y=None)[source]¶

Fit the AutoEstimator instance using a selected AutoML estimator.

Parameters:	X (`pandas.DataFrame` or `numpy.ndarray` or list) – The input feature(s) y (`pandas.DataFrame` or `numpy.ndarray` or list) – The response feature(s)
Returns:	self

predict(X)[source]¶

Use the trained estimator to predict the response.

Parameters:	X (pandas.DataFrame or `numpy.ndarray` or list) – The input feature(s)
Returns:	The response feature(s) (transformed)
Return type:	`pandas.DataFrame`

predict_proba(X)[source]¶

Use the trained estimator to predict the response probabilities.

Parameters:	X (`pandas.DataFrame` or `numpy.ndarray` or list) – The input feature(s)
Returns:	The probability associated with each feature
Return type:	`pandas.DataFrame`

score(X, y)[source]¶

Use the trained estimator to compute the evaluation score.

Note: sample weights are not supported

Parameters:	X (`pandas.DataFrame` or `numpy.ndarray` or list) – The input feature(s) y (`pandas.DataFrame` or `numpy.ndarray` or list) – The response feature(s)
Returns:	A computed prediction fitness score
Return type:	float

Optimizers¶

Foreshadow optimizers.

class ParamSpec(fs_pipeline=None, X_df=None, y_df=None)[source]¶

Holds the specification of the parameter search space.

A search space is a dict or list of dicts. This search space should be viewed as one run of optimization on the foreshadow object. The algorithm for optimization is determined by the optimizer that is chosen. Hence, this specification is agnostic of the optimizer chosen.

A dict represents the set of parameters to be applied in a single run.

A list represents a set of choices that the algorithm (again, agnostic at this point) can pick from.

For example, imagine s as our top level object, of structure:

s (object)

.transformer (object)

.attr

s has an attribute that may be optimized and in turn, that object has parameters that may be optimized. Below, we try two different transformers and try 2 different parameter specifications for each. Note that these parameters are specific to the type of transformer (StandardScaler does not have the parameter feature_range and vice versa).

[

{

“s__transformer”: “StandardScaler”, “s__transformer__with_mean”: [False, True],

}, {

“s__transformer”: “MinMaxScaler”, “s__transformer__feature_range”: [(0, 1), (0, 0.5)] ),

},

],

Here, the dicts are used to tell the optimizer where to values to set are. The lists showcase the different values that are possible.

convert(key, replace_val=<function hp_choice>)[source]¶

Convert internal self.param_distributions to valid distribution.

Uses _replace_list to replace all lists with replace_val

Parameters:	key – key to use for top level hp.choice name replace_val – value to replace lists with.

get_params(deep=True)[source]¶

Get the params for this object. Used for serialization.

Parameters:	deep – Does nothing. Here for sklearn compatibility.
Returns:	Members that need to be set for this object.

set_params(**params)[source]¶

Set the params for this object. Used for serialization.

Also used to init this object when automatic tuning is not used.

Parameters:	**params – Members to set from get_params.
Returns:	self.

class Tuner(pipeline=None, params=None, optimizer=None, optimizer_kwargs={})[source]¶

Tunes the Foreshadow object using a ParamSpec and Optimizer.

fit(X, y, **fit_params)[source]¶

Optimize self.pipeline using self.optimizer.

Parameters:	X – input points y – input labels **fit_params – params to optimizer fit method.
Returns:	self

transform(pipeline)[source]¶

Transform pipeline using best_pipeline.

Parameters:	pipeline – input pipeline
Returns:	best_pipeline.

class RandomSearchCV(estimator, param_distributions, n_iter=10, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', random_state=None, error_score='raise', return_train_score='warn', max_tries=100)[source]¶: Optimize Foreshadow.pipeline and/or its sub-objects.

get(optimizer, **optimizer_kwargs)[source]¶

Get optimizer from foreshadow.optimizers package.

Parameters:	optimizer – optimizer name or class **optimizer_kwargs – kwargs used in instantiation.
Returns:	Corresponding instantiated optimizer using kwargs.

Utils¶

Common Foreshadow utilities.

get_cache_path()[source]¶

Get the cache path which is in the config directory.

Note

This function also makes the directory if it does not already exist.

Returns:	str; The path to the cache directory.

get_config_path()[source]¶

Get the default config path.

Note

This function also makes the directory if it does not already exist.

Returns:	The path to the config directory.
Return type:	str

get_transformer(class_name, source_lib=None)[source]¶

Get the transformer class from its name.

Note

In case of name conflict, internal transformer is preferred over external transformer import. This should only be using in internal unit tests, get_transformer from serialization should be preferred in all other cases. This was written to decouple registration from unit testing.

Parameters:	class_name (str) – The transformer class name source_lib (str) – The string import path if known
Returns:	Imported class
Raises:	`TransformerNotFound` – If class_name could not be found in internal or external transformer library pathways.

check_df(input_data, ignore_none=False, single_column=False, single_or_empty=False)[source]¶

Convert non dataframe inputs into dataframes.

Parameters:	input_data (`pandas.DataFrame`, `numpy.ndarray`, list) – input to convert ignore_none (bool) – allow None to pass through check_df single_column (bool) – check if frame is of a single column and return series single_or_empty (bool) – check if the frame is a single column or an empty DF.
Returns:	Converted and validated input dataframes
Return type:	`DataFrame`
Raises:	`ValueError` – Invalid input type `ValueError` – Input dataframe must only have one column

check_series(input_data)[source]¶

Convert non series inputs into series.

This is function is to be used in situations where a series is expected but cannot be guaranteed to exist. For example, this function is used in the metrics package to perform computations on a column using functions that only work with series.

Note

This is not to be used in transformers as it will break the standard that enforces only DataFrames as input and output for those objects.

Parameters:	input_data (iterable) – The input data
Returns:	pandas.Series
Raises:	`ValueError` – If the data could not be processed `ValueError` – If the input is a DataFrame and has more than one column

check_module_installed(name)[source]¶

Check whether a module is available for import.

Parameters:	name (str) – module name
Returns:	Whether the module can be imported
Return type:	bool

check_transformer_imports(printout=True)[source]¶

Determine which transformers were automatically imported.

Parameters:	printout (bool, optional) – Whether to output to stdout
Returns:	A tuple of the internal transformers and the external transformers
Return type:	tuple(list)

is_transformer(value, method='isinstance')[source]¶

Check if the class is a transformer class.

Parameters:	value – Class or instance method (str) – Method of checking. Options are ‘issubclass’ or ‘isinstance’
Returns:	True if transformer, False if not.
Raises:	`ValueError` – if method is neither issubclass or isinstance

is_wrapped(transformer)[source]¶

Check if a transformer is wrapped.

Parameters:	transformer – A transformer instance
Returns:	True if transformer is wrapped, otherwise False.
Return type:	bool

dynamic_import(attribute, module_path)[source]¶

Import attribute from module found at module_path at runtime.

Parameters:	attribute – the attribute of the module to import (class, function, …) module_path – the path to the module.
Returns:	attribute from module_path.

mode_freq(s, count=10)[source]¶

get_outliers(s, count=10)[source]¶

standard_col_summary(df)[source]¶

class ConfigureColumnSharerMixin[source]¶

Mixin that configure column sharer.

configure_column_sharer(column_sharer)[source]¶

Configure the column sharer attribute if exists.

Parameters:	column_sharer – a column sharer instance

API Reference¶

Foreshadow¶

dp¶

Intents¶

Transformers¶

Internal Transformers¶

Smart Transformers¶

Transformer Bases¶

Estimators¶

Optimizers¶

Utils¶

Core¶