lncrnapy.data

Contains Data class for containing, analyzing, and manipulating RNA sequence data.

class lncrnapy.data.Data(fasta_filepath=None, hdf_filepath=None, csv_filepath=None)

Container for RNA sequence data. Contains methods for data analysis and manipulation.

`df`

The underlying DataFrame object containing the data.

Type:

pd.DataFrame

`labelled`

Whether the data has labels or not.

Type:

bool

`X_name`

List of predictory feature names (columns) to be retrieved as tensors when __getitem__ is called.

Type:

list[str]

`X_dtype`

Data type for X.

Type:

type

`y_name`

List of target feature names (columns) to be retrieved as tensors when __getitem__ is called.

Type:

list[str]

`y_dtype`

Data type for y.

Type:

type

add_feature(feature_data, feature_names)

Safely adds feature_data as new columns to Data.

all_features(except_columns=['id', 'sequence', 'label'])

Returns a list of all features present in data.

calculate_feature(feature_extractor)

Extracts feature(s) from Data using feature_extractor.

check_columns(columns, behaviour='error')

Raises an error when a column from columns does not appear in this Data object.

Parameters:
  • columns (list[str]) – List of column names that are checked for present in Data object.

  • behaviour (['error'|'bool']) – Whether to raise an error in case of a missing column or whether to return False or True (default is ‘error’).

coding_noncoding_split()

Returns two Data objects, the first containing all pcRNA, the second containing all ncRNA

feature_correlation(feature_names)

Calculates the correlation between features in feature_names.

filter_outliers(feature_name, tolerance)

Removes data entries for which the value of feature_name falls outside of the tolerated range.

Parameters:
  • feature_name (str) – Name of the feature that should be considered.

  • tolerance (float`|`int`|`list`|`range) – When numeric, refers to the tolerated amount of standard deviation from the mean. When a list or range, refers to the exact tolerated lower (inclusive) and upper (exclusive) bound.

filter_sequence_quality(tolerance)

Removes data entries for which the percentage of uncertain bases (non-ACGT) exceeds a tolerance fraction.

num_coding_noncoding()

Returns a tuple of which the elements are the number of coding and non-coding sequences in the Data object, respectively.

plot_feature_boxplot(feature_name, filepath=None, figsize=None, **kwargs)

Returns a boxplot of the feature specified by feature_name, saving the plot to filepath if provided.

plot_feature_correlation(feature_names, filepath=None, figsize=None)

Plots heatmap of absolute correlation values.

plot_feature_density(feature_name, filepath=None, lower=0.025, upper=0.975, figsize=None, **kwargs)

Returns a density plot of the feature specified by feature_name, saving the plot to filepath if provided.

Parameters:
  • feature_name (str) – Name of the to-be-plotted feature.

  • filepath (str) – If specified, will save plot to this path.

  • lower (float) – Lower limit of density plot, indicated as quantile (default is 0.025).

  • upper (float) – Upper limit of density plot, indicated as quantile (default is 0.975).

  • figsize (tuple[int]) – Matplotlib figure size (default is None).

  • kwargs – Any keyword argument from pd.DataFrame.plot.density.

plot_feature_scatter(x_feature_name, y_feature_name, c_feature_name=None, c_lower=0.025, c_upper=0.975, filepath=None, axis_labels=True, xlim=None, ylim=None, figsize=None)

Returns a scatter plot with x_feature_name on the x-axis plotted against y_feature_name on the y-axis.

plot_feature_space(feature_names, dim_red=TSNE(), filepath=None, figsize=None)

Returns a visualization of the feature space of the data, reducing the dimensionality to 2.

plot_feature_violin(feature_name, filepath=None, figsize=None, **kwargs)

Returns a violin plot of the feature specified by feature_name, saving the plot to filepath if provided.

pos_weight()

Ratio of non-coding/coding samples, used as weight for positive class in weighted loss calculation.

sample(pc=None, nc=None, N=None, replace=False, random_state=None)

Returns a randomly sampled Data object.

Parameters:
  • pc (int`|`float`|`None) – Number/fraction of protein-coding sequences in resulting dataset.

  • nc (int`|`float`|`None) – Number/fraction of protein-coding sequences in resulting dataset.

  • N (int`|`None) – Total number of sequences in resulting dataset. When specified together with pc and nc, will consider the latter two as fractions.

  • replace (bool) – Whether or not to sample with replacement. Required if N or pc+nc exceeds the number of samples in the Data object.

  • random_state (int) – Seed for random number generator.

set_random_reading_frame(rrf)

Sample data in a random reading frame by deleting a random number ( within the range [0,`rrf`]) of nucleotides from the start of a sequence. Only works with 4D-DNA encoding.

set_tensor_features(X_name, X_dtype=torch.float32, y_name=None, y_dtype=torch.float32, len_4d_dna=7670)

Configures Data object to return a tuple of tensors (X,y) whenever __getitem__ is called.

Parameters:
  • X_name (list[str] | str) – Predictory feature names (columns) to be retrieved as tensors when __getitem__ is called.

  • X_dtype (type) – Data type for X (default is torch.float32)

  • y_name (list[str] | str) – Target feature names (columns) to be retrieved as tensors when __getitem__ is called. If None (default) and the data is labelled, will set ‘label’ as target feature, with 1 indicating pcRNA and 0 lncRNA.

  • y_dtype (type) – Data type for y (default is torch.float32)

  • len_4d_dna (int) – Max length of returned seq. when X_name==’4D-DNA’ (default 7670)

test_features(feature_names)

Evaluates statistical significance of features specified in feature_names using a t-test.

to_csv(path_or_buf, except_columns=['sequence'], **kwargs)

Write data to .csv file.

Parameters:
  • path_or_buf – Target file or buffer.

  • except_columns (list[str]:) – Column names specified here won’t be exported (default is [‘sequence’])

  • kwargs – Any keyword argument accepted by pd.DataFrame.to_csv

to_fasta(fasta_filepath)

Writes sequence data to FASTA file(s) specified by fasta_filepath, which can be a string or a list of strings (length 2) indicating the filepaths for for coding and non-coding transcripts, respectively.

to_hdf(path_or_buf, except_columns=['sequence'], **kwargs)

Write data to .h5 file.

Parameters:
  • path_or_buf – Target file or buffer.

  • except_columns (list[str]:) – Column names specified here won’t be exported (default is [‘sequence’])

  • kwargs – Any keyword argument accepted by pd.DataFrame.to_hdf

train_test_split(test_size, **kwargs)

Splits data up in train and test datasets, as specified by test_size. Accepts all keyword arguments from sklearn.model_selection.train_test_split.

lncrnapy.data.get_gencode_gene_names(data)

Returns list of GENCODE gene names extracted from “id” column.

lncrnapy.data.get_rna_type_refseq(fasta_header)

Extract the RNA type from an input FASTA header line

lncrnapy.data.merge_fasta(in_filepaths, out_filepath)

Merges FASTA files in in_filepaths into a single out_filepath.

lncrnapy.data.plot_cross_dataset_violins(data_objects, data_names, feature_name, filepath=None, upper=0.975, lower=0.025, figsize=None, **kwargs)

Creates violin plots for multiple data_objects for a given feature_name. This allows to compare datasets.

lncrnapy.data.plot_refseq_labels(fasta_filepath, filepath=None, figsize=None)

Plots the distribution of RNA labels of a FASTA file that follows the RefSeq format, optionally saving the figure to filepath.

lncrnapy.data.reduce_dimensionality(data, dim_red=TSNE())

Reduces the dimensionality of data using dim_red.

lncrnapy.data.split_refseq(in_filepath, pc_filepath, nc_filepath, pc_types=['mRNA'], nc_types=['long non-coding RNA'])

Splits RefSeq FASTA file into two files, coding and noncoding.