API reference¶
The public API of evo_gafs.
Genetic-algorithm wrapper feature selector, compatible with scikit-learn. |
|
Full configuration of the genetic algorithm. |
|
Final outcome of a GA feature-selection run. |
|
Statistics for a single generation during evolution. |
|
Evaluate an individual (binary feature mask) with cross-validation. |
|
Run and compare GA feature selection across several datasets. |
|
Static helpers to visualise evolution curves and Pareto fronts. |
Selector¶
- class evo_gafs.GAFeatureSelector(estimator, config=None, scoring=None, task_type='auto', feature_names=None)[source]¶
Bases:
SelectorMixin,BaseEstimatorGenetic-algorithm wrapper feature selector, compatible with scikit-learn.
The selector searches for the subset of features that maximises a cross-validated score of
estimator(the wrapper criterion), optionally trading raw performance for a smaller feature set.- Parameters:
estimator (sklearn estimator) – Model used to score candidate feature subsets. Must implement
fitandpredict. Fast estimators (decision trees, linear models) keep the search affordable. It is cloned for every evaluation, never fitted in place.config (GAConfig, optional) – Genetic-algorithm configuration. If
None, defaults are used.scoring (str, optional) – scikit-learn scoring string (e.g.
'accuracy','f1_macro','r2','neg_mean_squared_error'). IfNoneit is chosen fromtask_type.task_type ({'auto', 'classification', 'regression'}, default='auto') – Problem type.
'auto'infers it fromy.feature_names (list of str, optional) – Names of the input features. Inferred from a DataFrame’s columns or generated as
f0, f1, ...when not given.
- support_¶
Boolean mask of selected features.
- Type:
Examples
>>> from sklearn.tree import DecisionTreeClassifier >>> from evo_gafs import GAFeatureSelector, GAConfig >>> config = GAConfig(population_size=20, n_generations=10, verbose=False) >>> selector = GAFeatureSelector( ... estimator=DecisionTreeClassifier(random_state=42), ... config=config, ... ) >>> selector.fit(X_train, y_train) >>> X_selected = selector.transform(X_test)
In addition to the methods above,
GAFeatureSelectorinherits the standard scikit-learn transformer API fromsklearn.feature_selection.SelectorMixinandsklearn.base.BaseEstimator— notablytransform,fit_transform,get_support,get_paramsandset_params.- fit(X, y, callbacks=None)[source]¶
Run the genetic algorithm to find the best feature subset.
- Parameters:
X (array-like or pandas.DataFrame of shape (n_samples, n_features)) – Training data. Sparse matrices are not supported.
y (array-like of shape (n_samples,)) – Target values. Integers/strings for classification, floats for regression.
callbacks (list of callable, optional) – Functions
f(gen, stats, population) -> bool. ReturningTruestops evolution early.
- Returns:
self – The fitted selector.
- Return type:
- set_fit_request(*, callbacks='$UNCHANGED$')¶
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
callbacks (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
callbacksparameter infit.self (GAFeatureSelector)
- Returns:
self – The updated object.
- Return type:
Configuration & results¶
- class evo_gafs.GAConfig(population_size=50, n_generations=100, crossover_prob=0.8, mutation_prob=0.15, mutation_indpb=None, tournament_size=3, mode='single', alpha=0.8, cv_folds=5, min_features=1, elite_size=2, random_seed=42, n_jobs=1, verbose=True, early_stopping_rounds=None, early_stopping_tol=0.0001)[source]¶
Bases:
objectFull configuration of the genetic algorithm.
- Parameters:
population_size (int, default=50) – Number of individuals in the population. Larger populations explore the search space better at a higher computational cost. Typical: 30-100.
n_generations (int, default=100) – Number of generations (iterations of the GA). Typical: 50-200.
crossover_prob (float, default=0.8) – Probability of applying crossover between two individuals. Recommended range:
[0.6, 0.9].mutation_prob (float, default=0.15) – Probability of applying the mutation operator to an individual. Recommended range:
[0.05, 0.3].mutation_indpb (float or None, default=None) – Independent probability of flipping each bit when an individual is mutated. If
None, it is set to1 / n_featuresat fit time.tournament_size (int, default=3) – Tournament size for tournament selection (
mode='single'). Larger values increase selective pressure. Typical: 2-7.mode ({'single', 'multiobjective'}, default='single') –
'single'uses a weighted scalar fitness;'multiobjective'uses NSGA-II and returns a Pareto front.alpha (float, default=0.8) –
Weight of the performance metric in
mode='single':fitness = alpha * cv_score + (1 - alpha) * compression_ratio
alpha=1.0is a pure wrapper; lower values favour compression (useful for edge deployment).cv_folds (int, default=5) – Number of cross-validation folds used to evaluate fitness.
min_features (int, default=1) – Minimum number of selected features. Individuals below this threshold are repaired/penalised.
elite_size (int, default=2) – Number of best individuals carried over unchanged each generation (elitism,
mode='single'only).random_seed (int or None, default=42) – Seed for reproducibility.
n_jobs (int, default=1) – Parallelism passed to scikit-learn’s cross-validation.
verbose (bool, default=True) – If
True, print a log line for some generations and a final summary.early_stopping_rounds (int or None, default=None) – If the best fitness does not improve for this many generations, stop.
Nonedisables early stopping (mode='single'only).early_stopping_tol (float, default=1e-4) – Minimum improvement considered significant for early stopping.
- validate()[source]¶
Validate the configuration, raising
ValueErroron error.Unlike
assertstatements, these checks are always enforced, even when Python runs with optimisations (-O).- Return type:
None
- class evo_gafs.SelectionResult(selected_mask, selected_indices, selected_feature_names, best_fitness, best_cv_score, n_selected, compression_ratio, history=<factory>, pareto_front=None, config=None, total_time=0.0, n_evaluations=0)[source]¶
Bases:
objectFinal outcome of a GA feature-selection run.
- Parameters:
- selected_mask¶
Boolean vector of length
n_features.Truemarks a selected feature.- Type:
- selected_indices¶
Indices of the selected features.
- Type:
- history¶
Per-generation statistics.
- Type:
- pareto_front¶
Only in
mode='multiobjective'. Each entry holdsmask,cv_score,compressionandn_features.
- classmethod load(path)[source]¶
Load a pickled
SelectionResultfrompath.- Parameters:
- Return type:
Evaluation¶
- class evo_gafs.FitnessEvaluator(estimator, X, y, scoring, cv, config)[source]¶
Bases:
objectEvaluate an individual (binary feature mask) with cross-validation.
The evaluator is instantiated once per fit and registered with DEAP as the
evaluateoperator. It caches results per individual to avoid re-running cross-validation for genomes that have already been seen.Penalisation strategy¶
Fewer than
min_featuresactive features -> fitness of zero.
Fitness by mode¶
'single':fitness = alpha * cv_score + (1 - alpha) * compression
where
compression = 1 - n_selected / n_total.'multiobjective': returns(cv_score, compression), both maximised (DEAP/NSGA-II handles the Pareto front).
- param estimator:
Model used as the wrapper criterion. It is cloned for each evaluation.
- type estimator:
sklearn estimator
- param X:
Feature matrix.
- type X:
numpy.ndarray of shape (n_samples, n_features)
- param y:
Target vector.
- type y:
numpy.ndarray of shape (n_samples,)
- param scoring:
scikit-learn scoring string.
- type scoring:
str
- param cv:
Splitter used for the score.
- type cv:
cross-validation splitter
- param config:
Configuration (provides
mode,alpha,min_features,n_jobs).- type config:
GAConfig
Benchmarking & visualization¶
- class evo_gafs.BenchmarkRunner[source]¶
Bases:
objectRun and compare GA feature selection across several datasets.
For each registered dataset the runner records:
the model’s cross-validated score using all features (baseline),
the score using the features selected by the GA,
the compression ratio, and
the wall-clock time.
Examples
>>> runner = BenchmarkRunner() >>> runner.add_dataset("Iris", X, y, task_type="classification") >>> runner.run(DecisionTreeClassifier()) >>> runner.report()
- add_dataset(name, X, y, task_type='auto', description='')[source]¶
Register a dataset for the benchmark. Returns
selffor chaining.
- run(estimator, config=None, scoring=None, verbose=True, estimator_regression=None)[source]¶
Run the benchmark over all registered datasets.
- Parameters:
estimator (sklearn estimator) – Model for classification datasets (and for all datasets when
estimator_regressionis not given).config (GAConfig, optional) – Configuration applied to every run.
scoring (str, optional) – Scoring string; auto-selected per task when
None.verbose (bool, default=True) – Print a per-dataset progress report.
estimator_regression (sklearn estimator, optional) – Alternative model for regression datasets.
- Returns:
One result entry per dataset.
- Return type:
- report()[source]¶
Return (and print) a summary
pandas.DataFrameof the runs.- Return type:
- class evo_gafs.GAPlotter[source]¶
Bases:
objectStatic helpers to visualise evolution curves and Pareto fronts.
- static plot_evolution(result, figsize=(14, 5), title_prefix='')[source]¶
Plot fitness and feature-count evolution over generations.
- Parameters:
result (SelectionResult)
title_prefix (str)
- static plot_pareto_front(result, figsize=(8, 6), title='Pareto front')[source]¶
Plot the Pareto front (multi-objective mode only).
- Parameters:
result (SelectionResult)
title (str)