API reference

The public API of evo_gafs.

GAFeatureSelector

Genetic-algorithm wrapper feature selector, compatible with scikit-learn.

GAConfig

Full configuration of the genetic algorithm.

SelectionResult

Final outcome of a GA feature-selection run.

EvolutionStats

Statistics for a single generation during evolution.

FitnessEvaluator

Evaluate an individual (binary feature mask) with cross-validation.

BenchmarkRunner

Run and compare GA feature selection across several datasets.

GAPlotter

Static helpers to visualise evolution curves and Pareto fronts.

Selector

class evo_gafs.GAFeatureSelector(estimator, config=None, scoring=None, task_type='auto', feature_names=None)[source]

Bases: SelectorMixin, BaseEstimator

Genetic-algorithm wrapper feature selector, compatible with scikit-learn.

The selector searches for the subset of features that maximises a cross-validated score of estimator (the wrapper criterion), optionally trading raw performance for a smaller feature set.

Parameters:
  • estimator (sklearn estimator) – Model used to score candidate feature subsets. Must implement fit and predict. Fast estimators (decision trees, linear models) keep the search affordable. It is cloned for every evaluation, never fitted in place.

  • config (GAConfig, optional) – Genetic-algorithm configuration. If None, defaults are used.

  • scoring (str, optional) – scikit-learn scoring string (e.g. 'accuracy', 'f1_macro', 'r2', 'neg_mean_squared_error'). If None it is chosen from task_type.

  • task_type ({'auto', 'classification', 'regression'}, default='auto') – Problem type. 'auto' infers it from y.

  • feature_names (list of str, optional) – Names of the input features. Inferred from a DataFrame’s columns or generated as f0, f1, ... when not given.

result_

Full result of the run (set after fit()).

Type:

SelectionResult

support_

Boolean mask of selected features.

Type:

numpy.ndarray of bool

n_features_in_

Number of features seen during fit().

Type:

int

feature_names_in_

Names of features seen during fit() (only when X is a DataFrame).

Type:

numpy.ndarray

Examples

>>> from sklearn.tree import DecisionTreeClassifier
>>> from evo_gafs import GAFeatureSelector, GAConfig
>>> config = GAConfig(population_size=20, n_generations=10, verbose=False)
>>> selector = GAFeatureSelector(
...     estimator=DecisionTreeClassifier(random_state=42),
...     config=config,
... )
>>> selector.fit(X_train, y_train)
>>> X_selected = selector.transform(X_test)

In addition to the methods above, GAFeatureSelector inherits the standard scikit-learn transformer API from sklearn.feature_selection.SelectorMixin and sklearn.base.BaseEstimator — notably transform, fit_transform, get_support, get_params and set_params.

fit(X, y, callbacks=None)[source]

Run the genetic algorithm to find the best feature subset.

Parameters:
  • X (array-like or pandas.DataFrame of shape (n_samples, n_features)) – Training data. Sparse matrices are not supported.

  • y (array-like of shape (n_samples,)) – Target values. Integers/strings for classification, floats for regression.

  • callbacks (list of callable, optional) – Functions f(gen, stats, population) -> bool. Returning True stops evolution early.

Returns:

self – The fitted selector.

Return type:

GAFeatureSelector

summary()[source]

Return a human-readable summary of the fitted result.

Return type:

str

set_fit_request(*, callbacks='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • callbacks (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for callbacks parameter in fit.

  • self (GAFeatureSelector)

Returns:

self – The updated object.

Return type:

object

Configuration & results

class evo_gafs.GAConfig(population_size=50, n_generations=100, crossover_prob=0.8, mutation_prob=0.15, mutation_indpb=None, tournament_size=3, mode='single', alpha=0.8, cv_folds=5, min_features=1, elite_size=2, random_seed=42, n_jobs=1, verbose=True, early_stopping_rounds=None, early_stopping_tol=0.0001)[source]

Bases: object

Full configuration of the genetic algorithm.

Parameters:
  • population_size (int, default=50) – Number of individuals in the population. Larger populations explore the search space better at a higher computational cost. Typical: 30-100.

  • n_generations (int, default=100) – Number of generations (iterations of the GA). Typical: 50-200.

  • crossover_prob (float, default=0.8) – Probability of applying crossover between two individuals. Recommended range: [0.6, 0.9].

  • mutation_prob (float, default=0.15) – Probability of applying the mutation operator to an individual. Recommended range: [0.05, 0.3].

  • mutation_indpb (float or None, default=None) – Independent probability of flipping each bit when an individual is mutated. If None, it is set to 1 / n_features at fit time.

  • tournament_size (int, default=3) – Tournament size for tournament selection (mode='single'). Larger values increase selective pressure. Typical: 2-7.

  • mode ({'single', 'multiobjective'}, default='single') – 'single' uses a weighted scalar fitness; 'multiobjective' uses NSGA-II and returns a Pareto front.

  • alpha (float, default=0.8) –

    Weight of the performance metric in mode='single':

    fitness = alpha * cv_score + (1 - alpha) * compression_ratio
    

    alpha=1.0 is a pure wrapper; lower values favour compression (useful for edge deployment).

  • cv_folds (int, default=5) – Number of cross-validation folds used to evaluate fitness.

  • min_features (int, default=1) – Minimum number of selected features. Individuals below this threshold are repaired/penalised.

  • elite_size (int, default=2) – Number of best individuals carried over unchanged each generation (elitism, mode='single' only).

  • random_seed (int or None, default=42) – Seed for reproducibility.

  • n_jobs (int, default=1) – Parallelism passed to scikit-learn’s cross-validation.

  • verbose (bool, default=True) – If True, print a log line for some generations and a final summary.

  • early_stopping_rounds (int or None, default=None) – If the best fitness does not improve for this many generations, stop. None disables early stopping (mode='single' only).

  • early_stopping_tol (float, default=1e-4) – Minimum improvement considered significant for early stopping.

validate()[source]

Validate the configuration, raising ValueError on error.

Unlike assert statements, these checks are always enforced, even when Python runs with optimisations (-O).

Return type:

None

to_dict()[source]

Return the configuration as a plain dictionary.

Return type:

dict

class evo_gafs.SelectionResult(selected_mask, selected_indices, selected_feature_names, best_fitness, best_cv_score, n_selected, compression_ratio, history=<factory>, pareto_front=None, config=None, total_time=0.0, n_evaluations=0)[source]

Bases: object

Final outcome of a GA feature-selection run.

Parameters:
selected_mask

Boolean vector of length n_features. True marks a selected feature.

Type:

numpy.ndarray

selected_indices

Indices of the selected features.

Type:

numpy.ndarray

selected_feature_names

Names of the selected features.

Type:

list of str

best_fitness

Best fitness achieved.

Type:

float

best_cv_score

Best cross-validation score (raw metric, unweighted).

Type:

float

n_selected

Number of selected features.

Type:

int

compression_ratio

Fraction of features removed (1 - n_selected / n_total).

Type:

float

history

Per-generation statistics.

Type:

list of EvolutionStats

pareto_front

Only in mode='multiobjective'. Each entry holds mask, cv_score, compression and n_features.

Type:

list of dict or None

config

Configuration used for the run.

Type:

GAConfig or None

total_time

Total wall-clock time in seconds.

Type:

float

n_evaluations

Total number of fitness evaluations performed.

Type:

int

summary()[source]

Return a human-readable multi-line summary of the result.

Return type:

str

to_json()[source]

Return a JSON-serialisable dictionary (e.g. for logging/MLflow).

Return type:

dict[str, Any]

save_json(path)[source]

Write the JSON-serialisable summary to path.

Parameters:

path (str | Path)

Return type:

None

save(path)[source]

Pickle the full result object to path.

Parameters:

path (str | Path)

Return type:

None

classmethod load(path)[source]

Load a pickled SelectionResult from path.

Parameters:

path (str | Path)

Return type:

SelectionResult

class evo_gafs.EvolutionStats(generation, best_fitness, mean_fitness, std_fitness, best_n_features, mean_n_features, elapsed_time)[source]

Bases: object

Statistics for a single generation during evolution.

Parameters:

Evaluation

class evo_gafs.FitnessEvaluator(estimator, X, y, scoring, cv, config)[source]

Bases: object

Evaluate an individual (binary feature mask) with cross-validation.

The evaluator is instantiated once per fit and registered with DEAP as the evaluate operator. It caches results per individual to avoid re-running cross-validation for genomes that have already been seen.

Penalisation strategy

  • Fewer than min_features active features -> fitness of zero.

Fitness by mode

  • 'single':

    fitness = alpha * cv_score + (1 - alpha) * compression
    

    where compression = 1 - n_selected / n_total.

  • 'multiobjective': returns (cv_score, compression), both maximised (DEAP/NSGA-II handles the Pareto front).

param estimator:

Model used as the wrapper criterion. It is cloned for each evaluation.

type estimator:

sklearn estimator

param X:

Feature matrix.

type X:

numpy.ndarray of shape (n_samples, n_features)

param y:

Target vector.

type y:

numpy.ndarray of shape (n_samples,)

param scoring:

scikit-learn scoring string.

type scoring:

str

param cv:

Splitter used for the score.

type cv:

cross-validation splitter

param config:

Configuration (provides mode, alpha, min_features, n_jobs).

type config:

GAConfig

cv_score(selected)[source]

Public helper: raw (unweighted) CV score for a feature subset.

Parameters:

selected (list[int])

Return type:

float

property eval_count: int

Number of (non-cached) fitness evaluations performed.

Parameters:
  • estimator (BaseEstimator)

  • X (np.ndarray)

  • y (np.ndarray)

  • scoring (str)

  • cv (object)

  • config (GAConfig)

Benchmarking & visualization

class evo_gafs.BenchmarkRunner[source]

Bases: object

Run and compare GA feature selection across several datasets.

For each registered dataset the runner records:

  • the model’s cross-validated score using all features (baseline),

  • the score using the features selected by the GA,

  • the compression ratio, and

  • the wall-clock time.

Examples

>>> runner = BenchmarkRunner()
>>> runner.add_dataset("Iris", X, y, task_type="classification")
>>> runner.run(DecisionTreeClassifier())
>>> runner.report()
add_dataset(name, X, y, task_type='auto', description='')[source]

Register a dataset for the benchmark. Returns self for chaining.

Parameters:
Return type:

BenchmarkRunner

run(estimator, config=None, scoring=None, verbose=True, estimator_regression=None)[source]

Run the benchmark over all registered datasets.

Parameters:
  • estimator (sklearn estimator) – Model for classification datasets (and for all datasets when estimator_regression is not given).

  • config (GAConfig, optional) – Configuration applied to every run.

  • scoring (str, optional) – Scoring string; auto-selected per task when None.

  • verbose (bool, default=True) – Print a per-dataset progress report.

  • estimator_regression (sklearn estimator, optional) – Alternative model for regression datasets.

Returns:

One result entry per dataset.

Return type:

list of dict

report()[source]

Return (and print) a summary pandas.DataFrame of the runs.

Return type:

DataFrame

property results: list[dict]

The list of result entries from the last run().

class evo_gafs.GAPlotter[source]

Bases: object

Static helpers to visualise evolution curves and Pareto fronts.

static plot_evolution(result, figsize=(14, 5), title_prefix='')[source]

Plot fitness and feature-count evolution over generations.

Parameters:
static plot_pareto_front(result, figsize=(8, 6), title='Pareto front')[source]

Plot the Pareto front (multi-objective mode only).

Parameters:
static plot_selected_features(result, feature_names=None, figsize=(10, 6), title='Selected vs removed features')[source]

Show which features were selected versus removed.

Parameters: