lncrnapy.data

Contains Data class for containing, analyzing, and manipulating RNA sequence data.

class lncrnapy.data.Data(fasta_filepath=None, hdf_filepath=None, csv_filepath=None)

Container for RNA sequence data. Contains methods for data analysis and manipulation.

`df`

The underlying DataFrame object containing the data.

Type:: pd.DataFrame

`labelled`

Whether the data has labels or not.

Type:: bool

`X_name`

List of predictory feature names (columns) to be retrieved as tensors when __getitem__ is called.

Type:: list[str]

`X_dtype`

Data type for X.

Type:: type

`y_name`

List of target feature names (columns) to be retrieved as tensors when __getitem__ is called.

Type:: list[str]

`y_dtype`

Data type for y.

Type:: type

add_feature(feature_data, feature_names): Safely adds feature_data as new columns to Data.

all_features(except_columns=['id', 'sequence', 'label']): Returns a list of all features present in data.

calculate_feature(feature_extractor): Extracts feature(s) from Data using feature_extractor.

check_columns(columns, behaviour='error')

Raises an error when a column from columns does not appear in this Data object.

Parameters:

columns (list[str]) – List of column names that are checked for present in Data object.
behaviour (['error'|'bool']) – Whether to raise an error in case of a missing column or whether to return False or True (default is ‘error’).

coding_noncoding_split(): Returns two Data objects, the first containing all pcRNA, the second containing all ncRNA

feature_correlation(feature_names): Calculates the correlation between features in feature_names.

filter_outliers(feature_name, tolerance)

Removes data entries for which the value of feature_name falls outside of the tolerated range.

Parameters:

feature_name (str) – Name of the feature that should be considered.
tolerance (float`|`int`|`list`|`range) – When numeric, refers to the tolerated amount of standard deviation from the mean. When a list or range, refers to the exact tolerated lower (inclusive) and upper (exclusive) bound.

filter_sequence_quality(tolerance): Removes data entries for which the percentage of uncertain bases (non-ACGT) exceeds a tolerance fraction.

num_coding_noncoding(): Returns a tuple of which the elements are the number of coding and non-coding sequences in the Data object, respectively.

plot_feature_boxplot(feature_name, filepath=None, figsize=None, **kwargs): Returns a boxplot of the feature specified by feature_name, saving the plot to filepath if provided.

plot_feature_correlation(feature_names, filepath=None, figsize=None): Plots heatmap of absolute correlation values.

plot_feature_density(feature_name, filepath=None, lower=0.025, upper=0.975, figsize=None, **kwargs)

Returns a density plot of the feature specified by feature_name, saving the plot to filepath if provided.

Parameters:

feature_name (str) – Name of the to-be-plotted feature.
filepath (str) – If specified, will save plot to this path.
lower (float) – Lower limit of density plot, indicated as quantile (default is 0.025).
upper (float) – Upper limit of density plot, indicated as quantile (default is 0.975).
figsize (tuple[int]) – Matplotlib figure size (default is None).
kwargs – Any keyword argument from pd.DataFrame.plot.density.

plot_feature_scatter(x_feature_name, y_feature_name, c_feature_name=None, c_lower=0.025, c_upper=0.975, filepath=None, axis_labels=True, xlim=None, ylim=None, figsize=None): Returns a scatter plot with x_feature_name on the x-axis plotted against y_feature_name on the y-axis.

plot_feature_space(feature_names, dim_red=TSNE(), filepath=None, figsize=None): Returns a visualization of the feature space of the data, reducing the dimensionality to 2.

plot_feature_violin(feature_name, filepath=None, figsize=None, **kwargs): Returns a violin plot of the feature specified by feature_name, saving the plot to filepath if provided.

pos_weight(): Ratio of non-coding/coding samples, used as weight for positive class in weighted loss calculation.

sample(pc=None, nc=None, N=None, replace=False, random_state=None)

Returns a randomly sampled Data object.

Parameters:

pc (int`|`float`|`None) – Number/fraction of protein-coding sequences in resulting dataset.
nc (int`|`float`|`None) – Number/fraction of protein-coding sequences in resulting dataset.
N (int`|`None) – Total number of sequences in resulting dataset. When specified together with pc and nc, will consider the latter two as fractions.
replace (bool) – Whether or not to sample with replacement. Required if N or pc+nc exceeds the number of samples in the Data object.
random_state (int) – Seed for random number generator.

set_random_reading_frame(rrf): Sample data in a random reading frame by deleting a random number ( within the range [0,`rrf`]) of nucleotides from the start of a sequence. Only works with 4D-DNA encoding.

set_tensor_features(X_name, X_dtype=torch.float32, y_name=None, y_dtype=torch.float32, len_4d_dna=7670)

Configures Data object to return a tuple of tensors (X,y) whenever __getitem__ is called.

Parameters:

X_name (list[str] | str) – Predictory feature names (columns) to be retrieved as tensors when __getitem__ is called.
X_dtype (type) – Data type for X (default is torch.float32)
y_name (list[str] | str) – Target feature names (columns) to be retrieved as tensors when __getitem__ is called. If None (default) and the data is labelled, will set ‘label’ as target feature, with 1 indicating pcRNA and 0 lncRNA.
y_dtype (type) – Data type for y (default is torch.float32)
len_4d_dna (int) – Max length of returned seq. when X_name==’4D-DNA’ (default 7670)

test_features(feature_names): Evaluates statistical significance of features specified in feature_names using a t-test.

to_csv(path_or_buf, except_columns=['sequence'], **kwargs)

Write data to .csv file.

Parameters:

path_or_buf – Target file or buffer.
except_columns (list[str]:) – Column names specified here won’t be exported (default is [‘sequence’])
kwargs – Any keyword argument accepted by pd.DataFrame.to_csv

to_fasta(fasta_filepath): Writes sequence data to FASTA file(s) specified by fasta_filepath, which can be a string or a list of strings (length 2) indicating the filepaths for for coding and non-coding transcripts, respectively.

to_hdf(path_or_buf, except_columns=['sequence'], **kwargs)

Write data to .h5 file.

Parameters:

path_or_buf – Target file or buffer.
except_columns (list[str]:) – Column names specified here won’t be exported (default is [‘sequence’])
kwargs – Any keyword argument accepted by pd.DataFrame.to_hdf

train_test_split(test_size, **kwargs): Splits data up in train and test datasets, as specified by test_size. Accepts all keyword arguments from sklearn.model_selection.train_test_split.

lncrnapy.data.get_gencode_gene_names(data): Returns list of GENCODE gene names extracted from “id” column.

lncrnapy.data.get_rna_type_refseq(fasta_header): Extract the RNA type from an input FASTA header line

lncrnapy.data.merge_fasta(in_filepaths, out_filepath): Merges FASTA files in in_filepaths into a single out_filepath.

lncrnapy.data.plot_cross_dataset_violins(data_objects, data_names, feature_name, filepath=None, upper=0.975, lower=0.025, figsize=None, **kwargs): Creates violin plots for multiple data_objects for a given feature_name. This allows to compare datasets.

lncrnapy.data.plot_refseq_labels(fasta_filepath, filepath=None, figsize=None): Plots the distribution of RNA labels of a FASTA file that follows the RefSeq format, optionally saving the figure to filepath.

lncrnapy.data.reduce_dimensionality(data, dim_red=TSNE()): Reduces the dimensionality of data using dim_red.

lncrnapy.data.split_refseq(in_filepath, pc_filepath, nc_filepath, pc_types=['mRNA'], nc_types=['long non-coding RNA']): Splits RefSeq FASTA file into two files, coding and noncoding.