lncrnapy.selection

A collection of feature selection and importance analysis methods.

Feature Importance Analysis

Contains feature_importance_analysis function, as well as several accompanying plotting functions that are tailored to this output format.

lncrnapy.selection.importance_analysis.feature_importance_analysis(trainsets, testsets, k, tables_folder, methods=[<class 'lncrnapy.selection.methods.NoSelection'>, <class 'lncrnapy.selection.methods.TTestSelection'>, <class 'lncrnapy.selection.methods.RegressionSelection'>, <class 'lncrnapy.selection.methods.ForestSelection'>, <class 'lncrnapy.selection.methods.RFESelection'>, <class 'lncrnapy.selection.methods.MDSSelection'>], excluded_features=['id', 'label', 'sequence', 'ORF protein', 'SSE'], test=False)

Runs a feature importance analysis, according to the following steps: - For every trainset in trainsets:

For every method in methods:

Assess feature importance.

Select k most important features.

Fit a random forest to the trainset using the selected features.

Evaluate the F1-score of the random forest on all testsets.

Parameters:

trainsets (list[str]) – Name of trainsets to be used for the importance analysis. For every name in the list, a hdf (.h5) file with feature data is assumed to be present in the directory specified by the tables_folder argument.
testsets (list[str]) – Name of testsets to report performance on. Like with trainsets, every testset is assumed as hdf (.h5) file with this name in tables_folder.
k (int) – Number of features to select.
methods (list[type]) – Type of feature selection methods to apply. Should be classes from lncrnapy.selection.
excluded_features (list[str]) – List of features to exclude from the importance analysis.
test (bool) – If True, performs analysis on 1000 random training samples.

Returns:

`importances` (pd.DataFrame) – Reports the importance of all features for every combination of trainset and selection method.
`results` (pd.DataFrame) – Reports the performance (macro-averaged F1-score) of all features for every combination of trainset, selection method, and testset.

lncrnapy.selection.importance_analysis.plot_feature_importance(importances, k=None, method=None, trainset=None, filepath=None, figsize=None)

Creates a feature importance plot, given the importances dataframe from feature_importance_analysis.

Parameters:

importances (pd.DataFrame) – Output from feature_importance_analysis.
k (int) – Top number of features to plot, if specified (default is None).
method (str) – Method to consider, if specified (default is None). If not specified and data contains multiple metrics, will convert importances to rank.
trainset (str) – Trainset name to consider, if specified (default is None). If not specified, averages over all trainsets.
filepath (str) – If specified, saves figure to this filepath (default is None).
figsize (tuple[int]) – Matplotlib figure size (default is None).

lncrnapy.selection.importance_analysis.plot_feature_selection_results(results, groupby, filepath=None, figsize=None)

Plots the performance of different feature selection methods, based on output from feature_importance_analysis.

Parameters:

results (pd.DataFrame) – Results output from feature_importance_analysis.
groupby (str | list[str] | tuple[str]) – How to group the data. If of type str, will average over this column. If of type list or tuple, will apply a nested grouping where the second element refers to the inner group.
filepath (str) – If specified, saves figure to this filepath (default is None).
figsize (tuple[int]) – Matplotlib figure size (default is None).

lncrnapy.selection.importance_analysis.sorted_feature_importance(importances, method=None, trainset=None): Sorts feature importances, averaging over method or trainset if specified.

Feature Selection Methods

Classes for selecting features based on an importance asssesment.

class lncrnapy.selection.methods.FeatureSelectionBase(name, metric_name, k)

Base class for feature selection / importance analysis.

`name`

Name of the applied method.

Type:: str

`metric_name`

Name of the metric that describes feature importance.

Type:: str

`k`

Number of features that will be selected.

Type:: int

select_features(data, feature_names): Selects features by assessing their importance for a given Data object. Returns a tuple in which the first element corresponds a list of selected feature names, and the second element is the feature importance array.

class lncrnapy.selection.methods.ForestSelection(k)

Feature selection based on the feature importance of a random forest.

select_features(data, feature_names): Selects features by assessing their importance for a given Data object. Returns a tuple in which the first element corresponds a list of selected feature names, and the second element is the feature importance array.

class lncrnapy.selection.methods.MDSSelection(k, lower=0.025, upper=0.975, smoothing=35, n_bins=1000)

Method based on Minimum Distribution Similarity (mDS) as proposed by DeepCPP. Uses relative entropy (Kullback-Leibler divergence) to calculate the difference between feature distributions of pcRNA and ncRNA, selects those that are most different from each other.

Parameters:

k (int) – Number of features that will be selected.
lower (float) – Values below this percentile are considered outliers (default is 0.025).
upper (float) – Values above this percentile are considered outliers (default is 0.975).
smoothing (int) – Amount (sigma) of Gaussian smoothing applied to both distributions (default is 35).
n_bins (int) – Number of bins to calculate distribution histgram (default is 1000).

References

DeepCPP: Zhang et al. (2020) https://doi.org/10.1093/bib/bbaa039

calculate(data, feature_name): Calculates Minimum Distribution Similarity (mDS) for given feature_name in data.

select_features(data, feature_names): Selects features by assessing their importance for a given Data object. Returns a tuple in which the first element corresponds a list of selected feature names, and the second element is the feature importance array.

class lncrnapy.selection.methods.NoSelection(k)

Dummy class that does not select or analyze features at all.

select_features(data, feature_names): Selects features by assessing their importance for a given Data object. Returns a tuple in which the first element corresponds a list of selected feature names, and the second element is the feature importance array.

class lncrnapy.selection.methods.PermutationSelection(k)

Calculates feature importance by performing permutations on them.

select_features(data, feature_names): Selects features by assessing their importance for a given Data object. Returns a tuple in which the first element corresponds a list of selected feature names, and the second element is the feature importance array.

class lncrnapy.selection.methods.RFESelection(k, step=0.01)

Recursive Feature Elimination. Uses ranks as importance measure.

select_features(data, feature_names): Selects features by assessing their importance for a given Data object. Returns a tuple in which the first element corresponds a list of selected feature names, and the second element is the feature importance array.

class lncrnapy.selection.methods.RegressionSelection(k)

Feature selection based on the size of regression coefficients.

select_features(data, feature_names): Selects features by assessing their importance for a given Data object. Returns a tuple in which the first element corresponds a list of selected feature names, and the second element is the feature importance array.

class lncrnapy.selection.methods.TTestSelection(k, alpha=0.05)

Feature selection based on an association test statistic (t-test).

select_features(data, feature_names): Selects features by assessing their importance for a given Data object. Returns a tuple in which the first element corresponds a list of selected feature names, and the second element is the feature importance array.