lncrnapy.data
Contains Data class for containing, analyzing, and manipulating RNA sequence data.
- class lncrnapy.data.Data(fasta_filepath=None, hdf_filepath=None, csv_filepath=None)
Container for RNA sequence data. Contains methods for data analysis and manipulation.
- `df`
The underlying DataFrame object containing the data.
- Type:
pd.DataFrame
- `labelled`
Whether the data has labels or not.
- Type:
bool
- `X_name`
List of predictory feature names (columns) to be retrieved as tensors when __getitem__ is called.
- Type:
list[str]
- `X_dtype`
Data type for X.
- Type:
type
- `y_name`
List of target feature names (columns) to be retrieved as tensors when __getitem__ is called.
- Type:
list[str]
- `y_dtype`
Data type for y.
- Type:
type
- add_feature(feature_data, feature_names)
Safely adds feature_data as new columns to Data.
- all_features(except_columns=['id', 'sequence', 'label'])
Returns a list of all features present in data.
- calculate_feature(feature_extractor)
Extracts feature(s) from Data using feature_extractor.
- check_columns(columns, behaviour='error')
Raises an error when a column from columns does not appear in this Data object.
- Parameters:
columns (list[str]) – List of column names that are checked for present in Data object.
behaviour (['error'|'bool']) – Whether to raise an error in case of a missing column or whether to return False or True (default is ‘error’).
- coding_noncoding_split()
Returns two Data objects, the first containing all pcRNA, the second containing all ncRNA
- feature_correlation(feature_names)
Calculates the correlation between features in feature_names.
- filter_outliers(feature_name, tolerance)
Removes data entries for which the value of feature_name falls outside of the tolerated range.
- Parameters:
feature_name (str) – Name of the feature that should be considered.
tolerance (float`|`int`|`list`|`range) – When numeric, refers to the tolerated amount of standard deviation from the mean. When a list or range, refers to the exact tolerated lower (inclusive) and upper (exclusive) bound.
- filter_sequence_quality(tolerance)
Removes data entries for which the percentage of uncertain bases (non-ACGT) exceeds a tolerance fraction.
- num_coding_noncoding()
Returns a tuple of which the elements are the number of coding and non-coding sequences in the Data object, respectively.
- plot_feature_boxplot(feature_name, filepath=None, figsize=None, **kwargs)
Returns a boxplot of the feature specified by feature_name, saving the plot to filepath if provided.
- plot_feature_correlation(feature_names, filepath=None, figsize=None)
Plots heatmap of absolute correlation values.
- plot_feature_density(feature_name, filepath=None, lower=0.025, upper=0.975, figsize=None, **kwargs)
Returns a density plot of the feature specified by feature_name, saving the plot to filepath if provided.
- Parameters:
feature_name (str) – Name of the to-be-plotted feature.
filepath (str) – If specified, will save plot to this path.
lower (float) – Lower limit of density plot, indicated as quantile (default is 0.025).
upper (float) – Upper limit of density plot, indicated as quantile (default is 0.975).
figsize (tuple[int]) – Matplotlib figure size (default is None).
kwargs – Any keyword argument from pd.DataFrame.plot.density.
- plot_feature_scatter(x_feature_name, y_feature_name, c_feature_name=None, c_lower=0.025, c_upper=0.975, filepath=None, axis_labels=True, xlim=None, ylim=None, figsize=None)
Returns a scatter plot with x_feature_name on the x-axis plotted against y_feature_name on the y-axis.
- plot_feature_space(feature_names, dim_red=TSNE(), filepath=None, figsize=None)
Returns a visualization of the feature space of the data, reducing the dimensionality to 2.
- plot_feature_violin(feature_name, filepath=None, figsize=None, **kwargs)
Returns a violin plot of the feature specified by feature_name, saving the plot to filepath if provided.
- pos_weight()
Ratio of non-coding/coding samples, used as weight for positive class in weighted loss calculation.
- sample(pc=None, nc=None, N=None, replace=False, random_state=None)
Returns a randomly sampled Data object.
- Parameters:
pc (int`|`float`|`None) – Number/fraction of protein-coding sequences in resulting dataset.
nc (int`|`float`|`None) – Number/fraction of protein-coding sequences in resulting dataset.
N (int`|`None) – Total number of sequences in resulting dataset. When specified together with pc and nc, will consider the latter two as fractions.
replace (bool) – Whether or not to sample with replacement. Required if N or pc+nc exceeds the number of samples in the Data object.
random_state (int) – Seed for random number generator.
- set_random_reading_frame(rrf)
Sample data in a random reading frame by deleting a random number ( within the range [0,`rrf`]) of nucleotides from the start of a sequence. Only works with 4D-DNA encoding.
- set_tensor_features(X_name, X_dtype=torch.float32, y_name=None, y_dtype=torch.float32, len_4d_dna=7670)
Configures Data object to return a tuple of tensors (X,y) whenever __getitem__ is called.
- Parameters:
X_name (list[str] | str) – Predictory feature names (columns) to be retrieved as tensors when __getitem__ is called.
X_dtype (type) – Data type for X (default is torch.float32)
y_name (list[str] | str) – Target feature names (columns) to be retrieved as tensors when __getitem__ is called. If None (default) and the data is labelled, will set ‘label’ as target feature, with 1 indicating pcRNA and 0 lncRNA.
y_dtype (type) – Data type for y (default is torch.float32)
len_4d_dna (int) – Max length of returned seq. when X_name==’4D-DNA’ (default 7670)
- test_features(feature_names)
Evaluates statistical significance of features specified in feature_names using a t-test.
- to_csv(path_or_buf, except_columns=['sequence'], **kwargs)
Write data to .csv file.
- Parameters:
path_or_buf – Target file or buffer.
except_columns (list[str]:) – Column names specified here won’t be exported (default is [‘sequence’])
kwargs – Any keyword argument accepted by pd.DataFrame.to_csv
- to_fasta(fasta_filepath)
Writes sequence data to FASTA file(s) specified by fasta_filepath, which can be a string or a list of strings (length 2) indicating the filepaths for for coding and non-coding transcripts, respectively.
- to_hdf(path_or_buf, except_columns=['sequence'], **kwargs)
Write data to .h5 file.
- Parameters:
path_or_buf – Target file or buffer.
except_columns (list[str]:) – Column names specified here won’t be exported (default is [‘sequence’])
kwargs – Any keyword argument accepted by pd.DataFrame.to_hdf
- train_test_split(test_size, **kwargs)
Splits data up in train and test datasets, as specified by test_size. Accepts all keyword arguments from sklearn.model_selection.train_test_split.
- lncrnapy.data.get_gencode_gene_names(data)
Returns list of GENCODE gene names extracted from “id” column.
- lncrnapy.data.get_rna_type_refseq(fasta_header)
Extract the RNA type from an input FASTA header line
- lncrnapy.data.merge_fasta(in_filepaths, out_filepath)
Merges FASTA files in in_filepaths into a single out_filepath.
- lncrnapy.data.plot_cross_dataset_violins(data_objects, data_names, feature_name, filepath=None, upper=0.975, lower=0.025, figsize=None, **kwargs)
Creates violin plots for multiple data_objects for a given feature_name. This allows to compare datasets.
- lncrnapy.data.plot_refseq_labels(fasta_filepath, filepath=None, figsize=None)
Plots the distribution of RNA labels of a FASTA file that follows the RefSeq format, optionally saving the figure to filepath.
- lncrnapy.data.reduce_dimensionality(data, dim_red=TSNE())
Reduces the dimensionality of data using dim_red.
- lncrnapy.data.split_refseq(in_filepath, pc_filepath, nc_filepath, pc_types=['mRNA'], nc_types=['long non-coding RNA'])
Splits RefSeq FASTA file into two files, coding and noncoding.