lncrnapy.selection

A collection of feature selection and importance analysis methods.

Feature Importance Analysis

Contains feature_importance_analysis function, as well as several accompanying plotting functions that are tailored to this output format.

lncrnapy.selection.importance_analysis.feature_importance_analysis(trainsets, testsets, k, tables_folder, methods=[<class 'lncrnapy.selection.methods.NoSelection'>, <class 'lncrnapy.selection.methods.TTestSelection'>, <class 'lncrnapy.selection.methods.RegressionSelection'>, <class 'lncrnapy.selection.methods.ForestSelection'>, <class 'lncrnapy.selection.methods.RFESelection'>, <class 'lncrnapy.selection.methods.MDSSelection'>], excluded_features=['id', 'label', 'sequence', 'ORF protein', 'SSE'], test=False)

Runs a feature importance analysis, according to the following steps: - For every trainset in trainsets:

  • For every method in methods:
    • Assess feature importance.

    • Select k most important features.

    • Fit a random forest to the trainset using the selected features.

    • Evaluate the F1-score of the random forest on all testsets.

Parameters:
  • trainsets (list[str]) – Name of trainsets to be used for the importance analysis. For every name in the list, a hdf (.h5) file with feature data is assumed to be present in the directory specified by the tables_folder argument.

  • testsets (list[str]) – Name of testsets to report performance on. Like with trainsets, every testset is assumed as hdf (.h5) file with this name in tables_folder.

  • k (int) – Number of features to select.

  • methods (list[type]) – Type of feature selection methods to apply. Should be classes from lncrnapy.selection.

  • excluded_features (list[str]) – List of features to exclude from the importance analysis.

  • test (bool) – If True, performs analysis on 1000 random training samples.

Returns:

  • `importances` (pd.DataFrame) – Reports the importance of all features for every combination of trainset and selection method.

  • `results` (pd.DataFrame) – Reports the performance (macro-averaged F1-score) of all features for every combination of trainset, selection method, and testset.

lncrnapy.selection.importance_analysis.plot_feature_importance(importances, k=None, method=None, trainset=None, filepath=None, figsize=None)

Creates a feature importance plot, given the importances dataframe from feature_importance_analysis.

Parameters:
  • importances (pd.DataFrame) – Output from feature_importance_analysis.

  • k (int) – Top number of features to plot, if specified (default is None).

  • method (str) – Method to consider, if specified (default is None). If not specified and data contains multiple metrics, will convert importances to rank.

  • trainset (str) – Trainset name to consider, if specified (default is None). If not specified, averages over all trainsets.

  • filepath (str) – If specified, saves figure to this filepath (default is None).

  • figsize (tuple[int]) – Matplotlib figure size (default is None).

lncrnapy.selection.importance_analysis.plot_feature_selection_results(results, groupby, filepath=None, figsize=None)

Plots the performance of different feature selection methods, based on output from feature_importance_analysis.

Parameters:
  • results (pd.DataFrame) – Results output from feature_importance_analysis.

  • groupby (str | list[str] | tuple[str]) – How to group the data. If of type str, will average over this column. If of type list or tuple, will apply a nested grouping where the second element refers to the inner group.

  • filepath (str) – If specified, saves figure to this filepath (default is None).

  • figsize (tuple[int]) – Matplotlib figure size (default is None).

lncrnapy.selection.importance_analysis.sorted_feature_importance(importances, method=None, trainset=None)

Sorts feature importances, averaging over method or trainset if specified.

Feature Selection Methods

Classes for selecting features based on an importance asssesment.

class lncrnapy.selection.methods.FeatureSelectionBase(name, metric_name, k)

Base class for feature selection / importance analysis.

`name`

Name of the applied method.

Type:

str

`metric_name`

Name of the metric that describes feature importance.

Type:

str

`k`

Number of features that will be selected.

Type:

int

select_features(data, feature_names)

Selects features by assessing their importance for a given Data object. Returns a tuple in which the first element corresponds a list of selected feature names, and the second element is the feature importance array.

class lncrnapy.selection.methods.ForestSelection(k)

Feature selection based on the feature importance of a random forest.

select_features(data, feature_names)

Selects features by assessing their importance for a given Data object. Returns a tuple in which the first element corresponds a list of selected feature names, and the second element is the feature importance array.

class lncrnapy.selection.methods.MDSSelection(k, lower=0.025, upper=0.975, smoothing=35, n_bins=1000)

Method based on Minimum Distribution Similarity (mDS) as proposed by DeepCPP. Uses relative entropy (Kullback-Leibler divergence) to calculate the difference between feature distributions of pcRNA and ncRNA, selects those that are most different from each other.

Parameters:
  • k (int) – Number of features that will be selected.

  • lower (float) – Values below this percentile are considered outliers (default is 0.025).

  • upper (float) – Values above this percentile are considered outliers (default is 0.975).

  • smoothing (int) – Amount (sigma) of Gaussian smoothing applied to both distributions (default is 35).

  • n_bins (int) – Number of bins to calculate distribution histgram (default is 1000).

References

DeepCPP: Zhang et al. (2020) https://doi.org/10.1093/bib/bbaa039

calculate(data, feature_name)

Calculates Minimum Distribution Similarity (mDS) for given feature_name in data.

select_features(data, feature_names)

Selects features by assessing their importance for a given Data object. Returns a tuple in which the first element corresponds a list of selected feature names, and the second element is the feature importance array.

class lncrnapy.selection.methods.NoSelection(k)

Dummy class that does not select or analyze features at all.

select_features(data, feature_names)

Selects features by assessing their importance for a given Data object. Returns a tuple in which the first element corresponds a list of selected feature names, and the second element is the feature importance array.

class lncrnapy.selection.methods.PermutationSelection(k)

Calculates feature importance by performing permutations on them.

select_features(data, feature_names)

Selects features by assessing their importance for a given Data object. Returns a tuple in which the first element corresponds a list of selected feature names, and the second element is the feature importance array.

class lncrnapy.selection.methods.RFESelection(k, step=0.01)

Recursive Feature Elimination. Uses ranks as importance measure.

select_features(data, feature_names)

Selects features by assessing their importance for a given Data object. Returns a tuple in which the first element corresponds a list of selected feature names, and the second element is the feature importance array.

class lncrnapy.selection.methods.RegressionSelection(k)

Feature selection based on the size of regression coefficients.

select_features(data, feature_names)

Selects features by assessing their importance for a given Data object. Returns a tuple in which the first element corresponds a list of selected feature names, and the second element is the feature importance array.

class lncrnapy.selection.methods.TTestSelection(k, alpha=0.05)

Feature selection based on an association test statistic (t-test).

select_features(data, feature_names)

Selects features by assessing their importance for a given Data object. Returns a tuple in which the first element corresponds a list of selected feature names, and the second element is the feature importance array.