lncrnapy.selection
A collection of feature selection and importance analysis methods.
Feature Importance Analysis
Contains feature_importance_analysis function, as well as several accompanying plotting functions that are tailored to this output format.
- lncrnapy.selection.importance_analysis.feature_importance_analysis(trainsets, testsets, k, tables_folder, methods=[<class 'lncrnapy.selection.methods.NoSelection'>, <class 'lncrnapy.selection.methods.TTestSelection'>, <class 'lncrnapy.selection.methods.RegressionSelection'>, <class 'lncrnapy.selection.methods.ForestSelection'>, <class 'lncrnapy.selection.methods.RFESelection'>, <class 'lncrnapy.selection.methods.MDSSelection'>], excluded_features=['id', 'label', 'sequence', 'ORF protein', 'SSE'], test=False)
Runs a feature importance analysis, according to the following steps: - For every trainset in trainsets:
- For every method in methods:
Assess feature importance.
Select k most important features.
Fit a random forest to the trainset using the selected features.
Evaluate the F1-score of the random forest on all testsets.
- Parameters:
trainsets (list[str]) – Name of trainsets to be used for the importance analysis. For every name in the list, a hdf (.h5) file with feature data is assumed to be present in the directory specified by the tables_folder argument.
testsets (list[str]) – Name of testsets to report performance on. Like with trainsets, every testset is assumed as hdf (.h5) file with this name in tables_folder.
k (int) – Number of features to select.
methods (list[type]) – Type of feature selection methods to apply. Should be classes from lncrnapy.selection.
excluded_features (list[str]) – List of features to exclude from the importance analysis.
test (bool) – If True, performs analysis on 1000 random training samples.
- Returns:
`importances` (pd.DataFrame) – Reports the importance of all features for every combination of trainset and selection method.
`results` (pd.DataFrame) – Reports the performance (macro-averaged F1-score) of all features for every combination of trainset, selection method, and testset.
- lncrnapy.selection.importance_analysis.plot_feature_importance(importances, k=None, method=None, trainset=None, filepath=None, figsize=None)
Creates a feature importance plot, given the importances dataframe from feature_importance_analysis.
- Parameters:
importances (pd.DataFrame) – Output from feature_importance_analysis.
k (int) – Top number of features to plot, if specified (default is None).
method (str) – Method to consider, if specified (default is None). If not specified and data contains multiple metrics, will convert importances to rank.
trainset (str) – Trainset name to consider, if specified (default is None). If not specified, averages over all trainsets.
filepath (str) – If specified, saves figure to this filepath (default is None).
figsize (tuple[int]) – Matplotlib figure size (default is None).
- lncrnapy.selection.importance_analysis.plot_feature_selection_results(results, groupby, filepath=None, figsize=None)
Plots the performance of different feature selection methods, based on output from feature_importance_analysis.
- Parameters:
results (pd.DataFrame) – Results output from feature_importance_analysis.
groupby (str | list[str] | tuple[str]) – How to group the data. If of type str, will average over this column. If of type list or tuple, will apply a nested grouping where the second element refers to the inner group.
filepath (str) – If specified, saves figure to this filepath (default is None).
figsize (tuple[int]) – Matplotlib figure size (default is None).
- lncrnapy.selection.importance_analysis.sorted_feature_importance(importances, method=None, trainset=None)
Sorts feature importances, averaging over method or trainset if specified.
Feature Selection Methods
Classes for selecting features based on an importance asssesment.
- class lncrnapy.selection.methods.FeatureSelectionBase(name, metric_name, k)
Base class for feature selection / importance analysis.
- `name`
Name of the applied method.
- Type:
str
- `metric_name`
Name of the metric that describes feature importance.
- Type:
str
- `k`
Number of features that will be selected.
- Type:
int
- select_features(data, feature_names)
Selects features by assessing their importance for a given Data object. Returns a tuple in which the first element corresponds a list of selected feature names, and the second element is the feature importance array.
- class lncrnapy.selection.methods.ForestSelection(k)
Feature selection based on the feature importance of a random forest.
- select_features(data, feature_names)
Selects features by assessing their importance for a given Data object. Returns a tuple in which the first element corresponds a list of selected feature names, and the second element is the feature importance array.
- class lncrnapy.selection.methods.MDSSelection(k, lower=0.025, upper=0.975, smoothing=35, n_bins=1000)
Method based on Minimum Distribution Similarity (mDS) as proposed by DeepCPP. Uses relative entropy (Kullback-Leibler divergence) to calculate the difference between feature distributions of pcRNA and ncRNA, selects those that are most different from each other.
- Parameters:
k (int) – Number of features that will be selected.
lower (float) – Values below this percentile are considered outliers (default is 0.025).
upper (float) – Values above this percentile are considered outliers (default is 0.975).
smoothing (int) – Amount (sigma) of Gaussian smoothing applied to both distributions (default is 35).
n_bins (int) – Number of bins to calculate distribution histgram (default is 1000).
References
DeepCPP: Zhang et al. (2020) https://doi.org/10.1093/bib/bbaa039
- calculate(data, feature_name)
Calculates Minimum Distribution Similarity (mDS) for given feature_name in data.
- select_features(data, feature_names)
Selects features by assessing their importance for a given Data object. Returns a tuple in which the first element corresponds a list of selected feature names, and the second element is the feature importance array.
- class lncrnapy.selection.methods.NoSelection(k)
Dummy class that does not select or analyze features at all.
- select_features(data, feature_names)
Selects features by assessing their importance for a given Data object. Returns a tuple in which the first element corresponds a list of selected feature names, and the second element is the feature importance array.
- class lncrnapy.selection.methods.PermutationSelection(k)
Calculates feature importance by performing permutations on them.
- select_features(data, feature_names)
Selects features by assessing their importance for a given Data object. Returns a tuple in which the first element corresponds a list of selected feature names, and the second element is the feature importance array.
- class lncrnapy.selection.methods.RFESelection(k, step=0.01)
Recursive Feature Elimination. Uses ranks as importance measure.
- select_features(data, feature_names)
Selects features by assessing their importance for a given Data object. Returns a tuple in which the first element corresponds a list of selected feature names, and the second element is the feature importance array.
- class lncrnapy.selection.methods.RegressionSelection(k)
Feature selection based on the size of regression coefficients.
- select_features(data, feature_names)
Selects features by assessing their importance for a given Data object. Returns a tuple in which the first element corresponds a list of selected feature names, and the second element is the feature importance array.
- class lncrnapy.selection.methods.TTestSelection(k, alpha=0.05)
Feature selection based on an association test statistic (t-test).
- select_features(data, feature_names)
Selects features by assessing their importance for a given Data object. Returns a tuple in which the first element corresponds a list of selected feature names, and the second element is the feature importance array.