API reference¶

The public API of evo_gafs.

`GAFeatureSelector`	Genetic-algorithm wrapper feature selector, compatible with scikit-learn.
`GAConfig`	Full configuration of the genetic algorithm.
`SelectionResult`	Final outcome of a GA feature-selection run.
`EvolutionStats`	Statistics for a single generation during evolution.
`FitnessEvaluator`	Evaluate an individual (binary feature mask) with cross-validation.
`BenchmarkRunner`	Run and compare GA feature selection across several datasets.
`GAPlotter`	Static helpers to visualise evolution curves and Pareto fronts.

Selector¶

class evo_gafs.GAFeatureSelector(estimator, config=None, scoring=None, task_type='auto', feature_names=None)[source]¶

Bases: SelectorMixin, BaseEstimator

Genetic-algorithm wrapper feature selector, compatible with scikit-learn.

The selector searches for the subset of features that maximises a cross-validated score of estimator (the wrapper criterion), optionally trading raw performance for a smaller feature set.

Parameters:

estimator (sklearn estimator) – Model used to score candidate feature subsets. Must implement fit and predict. Fast estimators (decision trees, linear models) keep the search affordable. It is cloned for every evaluation, never fitted in place.
config (GAConfig, optional) – Genetic-algorithm configuration. If None, defaults are used.
scoring (str, optional) – scikit-learn scoring string (e.g. 'accuracy', 'f1_macro', 'r2', 'neg_mean_squared_error'). If None it is chosen from task_type.
task_type ({'auto', 'classification', 'regression'}, default='auto') – Problem type. 'auto' infers it from y.
feature_names (list of str, optional) – Names of the input features. Inferred from a DataFrame’s columns or generated as f0, f1, ... when not given.

result_¶

Full result of the run (set after fit()).

Type:: SelectionResult

support_¶

Boolean mask of selected features.

Type:: numpy.ndarray of bool

n_features_in_¶

Number of features seen during fit().

Type:: int

feature_names_in_¶

Names of features seen during fit() (only when X is a DataFrame).

Type:: numpy.ndarray

Examples

>>> from sklearn.tree import DecisionTreeClassifier
>>> from evo_gafs import GAFeatureSelector, GAConfig
>>> config = GAConfig(population_size=20, n_generations=10, verbose=False)
>>> selector = GAFeatureSelector(
...     estimator=DecisionTreeClassifier(random_state=42),
...     config=config,
... )
>>> selector.fit(X_train, y_train)
>>> X_selected = selector.transform(X_test)

In addition to the methods above, GAFeatureSelector inherits the standard scikit-learn transformer API from sklearn.feature_selection.SelectorMixin and sklearn.base.BaseEstimator — notably transform, fit_transform, get_support, get_params and set_params.

fit(X, y, callbacks=None)[source]¶

Run the genetic algorithm to find the best feature subset.

Parameters:

X (array-like or pandas.DataFrame of shape (n_samples, n_features)) – Training data. Sparse matrices are not supported.
y (array-like of shape (n_samples,)) – Target values. Integers/strings for classification, floats for regression.
callbacks (list of callable, optional) – Functions f(gen, stats, population) -> bool. Returning True stops evolution early.

Returns:

self – The fitted selector.

Return type:

GAFeatureSelector

summary()[source]¶

Return a human-readable summary of the fitted result.

Return type:: str

set_fit_request(*, callbacks='$UNCHANGED$')¶

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

callbacks (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for callbacks parameter in fit.
self (GAFeatureSelector)

Returns:

self – The updated object.

Return type:

object

Configuration & results¶

class evo_gafs.GAConfig(population_size=50, n_generations=100, crossover_prob=0.8, mutation_prob=0.15, mutation_indpb=None, tournament_size=3, mode='single', alpha=0.8, cv_folds=5, min_features=1, elite_size=2, random_seed=42, n_jobs=1, verbose=True, early_stopping_rounds=None, early_stopping_tol=0.0001)[source]¶

Bases: object

Full configuration of the genetic algorithm.

Parameters:

population_size (int, default=50) – Number of individuals in the population. Larger populations explore the search space better at a higher computational cost. Typical: 30-100.
n_generations (int, default=100) – Number of generations (iterations of the GA). Typical: 50-200.
crossover_prob (float, default=0.8) – Probability of applying crossover between two individuals. Recommended range: [0.6, 0.9].
mutation_prob (float, default=0.15) – Probability of applying the mutation operator to an individual. Recommended range: [0.05, 0.3].
mutation_indpb (float or None, default=None) – Independent probability of flipping each bit when an individual is mutated. If None, it is set to 1 / n_features at fit time.
tournament_size (int, default=3) – Tournament size for tournament selection (mode='single'). Larger values increase selective pressure. Typical: 2-7.
mode ({'single', 'multiobjective'}, default='single') – 'single' uses a weighted scalar fitness; 'multiobjective' uses NSGA-II and returns a Pareto front.
alpha (float, default=0.8) –
Weight of the performance metric in mode='single':
```
fitness = alpha * cv_score + (1 - alpha) * compression_ratio
```
alpha=1.0 is a pure wrapper; lower values favour compression (useful for edge deployment).
cv_folds (int, default=5) – Number of cross-validation folds used to evaluate fitness.
min_features (int, default=1) – Minimum number of selected features. Individuals below this threshold are repaired/penalised.
elite_size (int, default=2) – Number of best individuals carried over unchanged each generation (elitism, mode='single' only).
random_seed (int or None, default=42) – Seed for reproducibility.
n_jobs (int, default=1) – Parallelism passed to scikit-learn’s cross-validation.
verbose (bool, default=True) – If True, print a log line for some generations and a final summary.
early_stopping_rounds (int or None, default=None) – If the best fitness does not improve for this many generations, stop. None disables early stopping (mode='single' only).
early_stopping_tol (float, default=1e-4) – Minimum improvement considered significant for early stopping.

validate()[source]¶

Validate the configuration, raising ValueError on error.

Unlike assert statements, these checks are always enforced, even when Python runs with optimisations (-O).

Return type:: None

to_dict()[source]¶

Return the configuration as a plain dictionary.

Return type:: dict

class evo_gafs.SelectionResult(selected_mask, selected_indices, selected_feature_names, best_fitness, best_cv_score, n_selected, compression_ratio, history=<factory>, pareto_front=None, config=None, total_time=0.0, n_evaluations=0)[source]¶

Bases: object

Final outcome of a GA feature-selection run.

Parameters:

selected_mask (ndarray)
selected_indices (ndarray)
selected_feature_names (list[str])
best_fitness (float)
best_cv_score (float)
n_selected (int)
compression_ratio (float)
history (list[EvolutionStats])
pareto_front (list[dict] | None)
config (GAConfig | None)
total_time (float)
n_evaluations (int)

selected_mask¶

Boolean vector of length n_features. True marks a selected feature.

Type:: numpy.ndarray

selected_indices¶

Indices of the selected features.

Type:: numpy.ndarray

selected_feature_names¶

Names of the selected features.

Type:: list of str

best_fitness¶

Best fitness achieved.

Type:: float

best_cv_score¶

Best cross-validation score (raw metric, unweighted).

Type:: float

n_selected¶

Number of selected features.

Type:: int

compression_ratio¶

Fraction of features removed (1 - n_selected / n_total).

Type:: float

history¶

Per-generation statistics.

Type:: list of EvolutionStats

pareto_front¶

Only in mode='multiobjective'. Each entry holds mask, cv_score, compression and n_features.

Type:: list of dict or None

config¶

Configuration used for the run.

Type:: GAConfig or None

total_time¶

Total wall-clock time in seconds.

Type:: float

n_evaluations¶

Total number of fitness evaluations performed.

Type:: int

summary()[source]¶

Return a human-readable multi-line summary of the result.

Return type:: str

to_json()[source]¶

Return a JSON-serialisable dictionary (e.g. for logging/MLflow).

Return type:: dict[str, Any]

save_json(path)[source]¶

Write the JSON-serialisable summary to path.

Parameters:: path (str | Path)
Return type:: None

save(path)[source]¶

Pickle the full result object to path.

Parameters:: path (str | Path)
Return type:: None

classmethod load(path)[source]¶

Load a pickled SelectionResult from path.

Parameters:: path (str | Path)
Return type:: SelectionResult

class evo_gafs.EvolutionStats(generation, best_fitness, mean_fitness, std_fitness, best_n_features, mean_n_features, elapsed_time)[source]¶

Bases: object

Statistics for a single generation during evolution.

Parameters:

generation (int)
best_fitness (float)
mean_fitness (float)
std_fitness (float)
best_n_features (int)
mean_n_features (float)
elapsed_time (float)

Evaluation¶

class evo_gafs.FitnessEvaluator(estimator, X, y, scoring, cv, config)[source]¶

Bases: object

Evaluate an individual (binary feature mask) with cross-validation.

The evaluator is instantiated once per fit and registered with DEAP as the evaluate operator. It caches results per individual to avoid re-running cross-validation for genomes that have already been seen.

Penalisation strategy¶

Fewer than min_features active features -> fitness of zero.

Fitness by mode¶

'single':

fitness = alpha * cv_score + (1 - alpha) * compression

where compression = 1 - n_selected / n_total.

'multiobjective': returns (cv_score, compression), both maximised (DEAP/NSGA-II handles the Pareto front).

param estimator:: Model used as the wrapper criterion. It is cloned for each evaluation.
type estimator:: sklearn estimator
param X:: Feature matrix.
type X:: numpy.ndarray of shape (n_samples, n_features)
param y:: Target vector.
type y:: numpy.ndarray of shape (n_samples,)
param scoring:: scikit-learn scoring string.
type scoring:: str
param cv:: Splitter used for the score.
type cv:: cross-validation splitter
param config:: Configuration (provides mode, alpha, min_features, n_jobs).
type config:: GAConfig

cv_score(selected)[source]¶

Public helper: raw (unweighted) CV score for a feature subset.

Parameters:: selected (list[int])
Return type:: float

property eval_count: int¶: Number of (non-cached) fitness evaluations performed.

Parameters:

estimator (BaseEstimator)
X (np.ndarray)
y (np.ndarray)
scoring (str)
cv (object)
config (GAConfig)

Benchmarking & visualization¶

class evo_gafs.BenchmarkRunner[source]¶

Bases: object

Run and compare GA feature selection across several datasets.

For each registered dataset the runner records:

the model’s cross-validated score using all features (baseline),
the score using the features selected by the GA,
the compression ratio, and
the wall-clock time.

Examples

>>> runner = BenchmarkRunner()
>>> runner.add_dataset("Iris", X, y, task_type="classification")
>>> runner.run(DecisionTreeClassifier())
>>> runner.report()

add_dataset(name, X, y, task_type='auto', description='')[source]¶

Parameters:

name (str)
X (ndarray | DataFrame)
y (ndarray | DataFrame)
task_type (str)
description (str)

Return type:

BenchmarkRunner

run(estimator, config=None, scoring=None, verbose=True, estimator_regression=None)[source]¶

Run the benchmark over all registered datasets.

Parameters:

estimator (sklearn estimator) – Model for classification datasets (and for all datasets when estimator_regression is not given).
config (GAConfig, optional) – Configuration applied to every run.
scoring (str, optional) – Scoring string; auto-selected per task when None.
verbose (bool, default=True) – Print a per-dataset progress report.
estimator_regression (sklearn estimator, optional) – Alternative model for regression datasets.

Returns:

One result entry per dataset.

Return type:

list of dict

report()[source]¶

Return (and print) a summary pandas.DataFrame of the runs.

Return type:: DataFrame

property results: list[dict]¶: The list of result entries from the last run().

class evo_gafs.GAPlotter[source]¶

Bases: object

Static helpers to visualise evolution curves and Pareto fronts.

static plot_evolution(result, figsize=(14, 5), title_prefix='')[source]¶

Plot fitness and feature-count evolution over generations.

Parameters:

result (SelectionResult)
figsize (tuple[float, float])
title_prefix (str)

static plot_pareto_front(result, figsize=(8, 6), title='Pareto front')[source]¶

Plot the Pareto front (multi-objective mode only).

Parameters:

result (SelectionResult)
figsize (tuple[float, float])
title (str)

static plot_selected_features(result, feature_names=None, figsize=(10, 6), title='Selected vs removed features')[source]¶

Show which features were selected versus removed.

Parameters:

result (SelectionResult)
feature_names (list[str] | None)
figsize (tuple[float, float])
title (str)