lncrnapy.features

Contains feature extractor classes that can calculate several features of RNA sequences, such as Most-Like Coding Sequences and nucleotide frequencies.

Every feature extractor class contains: * A name attribute of type str, indicating what name a Data column for this feature will have. * A calculate method with a Data object as argument, returning a list or array of the same length as the Data object.

BLAST

Feature extractors that perform a BLAST database search.

class lncrnapy.features.blast.BLASTNIdXCov

Alignment identity x coverage

calculate(data): Calculates the BLASTN id x cov feature for every row in data

class lncrnapy.features.blast.BLASTNSearch(reference, remote=False, strand='plus', threads=None, output_dir='', save_results=False)

Performs BLASTN database search, returns identity, alignment length and bit score for hit with max bit score per query.

`reference`

Reference to use for BLASTX feature extraction, usually the path to a local BLAST database. When the provided string ends with ‘.csv’, will assume that it refers to a saved BLASTX output from a finished run. Note that all other arguments will then be ignored. Alternatively, when running remotely (remote=True), this argument should correspond to the name of an official BLAST database.

Type:: str

`remote`

Whether to run remotely or locally. If False, requires a local installation of BLAST with a callable blastx program (default is False).

Type:: bool

`strand`

Which reading direction(s) to consider (default is ‘plus’).

Type:: [‘both’|’plus’|’minus’]

`threads`

Specifies how many threads for BLAST to use (when running locally).

Type:: int

`tmp_folder`

Path to folder where temporary FASTA and output files will be saved.

Type:: str

`name`

Column names for extracted features.

Type:: list[str]

calculate(data): Calculates BLASTX database search features for all rows in data.

read_blastn_output(out_filepath): Reads a BLAST .csv output file (outfmt 10) with good column names.

run_blastn(data): Runs a BLASTX database search for all rows in data.

class lncrnapy.features.blast.BLASTXBinary(threshold=0)

Calculates whether or not the number of BLASTX hits surpasses a preset threshold.

`threshold`

Minimum number of BLASTX hits to be surpassed to return True (default is 0).

Type:: int

`name`

Name of BLASTXBinary feature (‘BLASTX hits > {threshold}’)

Type:: str

calculate(data): Calculates the BLASTX binary feature for every row in data

class lncrnapy.features.blast.BLASTXSearch(reference, remote=False, evalue=1e-10, strand='plus', threads=None, output_dir='', save_results=False)

Derives features based on a BLAST database search: * BLASTX hits: number of blastx hits found for given transcript. * BLASTX hit score: mean of the mean log evalue over each of the three reading frames (as proposed by CPC). * BLASTX frame score: mean of the deviation from the hit score over each of the three reading frames (as proposed by CPC). * BLASTX S-score: sum of the logs of the significant scores (as proposed by PLncPro) * BLASTX bit score: total bit score (as proposed by PLncPro) * BLASTX frame entropy: Shannon entropy of probabilities that hits are in the ith frame (as proposed by PLncPro) * BLASTX identity: sum of the identity percentage for each of the blast hits for a given sequence.

`reference`

Reference to use for BLASTX feature extraction, usually the path to a local BLAST database. When the provided string ends with ‘.csv’, will assume that it refers to a saved BLASTX output from a finished run. Note that all other arguments will then be ignored. Alternatively, when running remotely (remote=True), this argument should correspond to the name of an official BLAST database.

Type:: str

`remote`

Whether to run remotely or locally. If False, requires a local installation of BLAST with a callable blastx program (default is False).

Type:: bool

`evalue`

Cut-off value for statistical significance (default is 1e-10).

Type:: float

`strand`

Which reading direction(s) to consider (default is ‘plus’).

Type:: [‘both’|’plus’|’minus’]

`threads`

Specifies how many threads for BLAST to use (when running locally).

Type:: int

`tmp_folder`

Path to folder where temporary FASTA and output files will be saved.

Type:: str

`name`

Column names for MLCDS length standard deviation (‘MLCDS length (std)’)

Type:: list[str]

References

CPC: Kong et al. (2007) https://doi.org/10.1093/nar/gkm391 PLncPro: Singh et al. (2017) https://doi.org/10.1093/nar/gkx866

calculate(data): Calculates BLASTX database search features for all rows in data.

calculate_per_sequence(blast_result): Calculate BLASTX features for given query result

read_blastx_output(out_filepath): Reads a BLAST .csv output file (outfmt 10) with good column names.

run_blastx(data): Runs a BLASTX database search for all rows in data.

EIIP

Features based on Electron-IonInteraction profile, proposed by Han et al. (2018)

class lncrnapy.features.eiip.EIIPPhysicoChemical(eiip_map={'A': 0.126, 'C': 0.134, 'G': 0.0806, 'T': 0.1335})

EIIP-derived physico-chemical features, as proposed by LNCFinder. Every sequence is converted into an EIIP representation, of which the power spectrum is calculated with a Fast Fourier Transform. Several properties are derived from this power spectrum.

`name`

Names of the EIIP-derived physico-chemical features.

Type:: str

`eiip_map`

Mapping to convert nucleotides into EIIP values.

Type:: dict[str:float]

References

LNCFinder: Han et al. (2018) https://doi.org/10.1093/bib/bby065

calculate(data): Calculate EIIP physico-chemical features for every row in data.

calculate_per_sequence(sequence): Calculate EIIP physico-chemical features of given sequence.

calculate_power_spectrum(sequence): Given an RNA sequence, convert it to EIIP values and calculate its power spectrum.

Fickett

Fickett TESTCODE statistic (Fickett et al., 1982).

class lncrnapy.features.fickett.FickettScore(data, export_path=None)

Calculates the Fickett TESTCODE statistic as used by CPAT. Summarizes bias in nucleotide position values and frequencies.

`name`

Column name for Fickett score (‘Fickett Score’)

Type:: str

`pos_intervals`

Thresholds determining which index to use for the position LUT.

Type:: np.ndarray

`cont_intervals`

Thresholds determining which index to use for the content LUT.

Type:: np.ndarray

`pos_lut`

Position value look-up table. Percentage of coding fragments in the interval (as defined by pos_interval) divided by the total number of fragments in the interval.

Type:: np.ndarray

`cont_lut`

Nucleotide content look-up table. Percentage of coding fragments in the interval (as defined by cont_interval) divided by the total number of fragments in the interval.

Type:: np.ndarray

`fickett_weights`

Parameters used to weigh the four position values and four nucleotide content values. Equals the percentage of time that each value alone can successfully distinguish between coding/noncoding sequences.

Type:: np.ndarray

References

CPAT: Wang et al. (2013) https://doi.org/10.1093/nar/gkt006 Fickett et al. (1982) https://doi.org/10.1093/nar/10.17.5303

calculate(data): Calculates Fickett score for every row in data.

calculate_per_sequence(sequence): Calculates Fickett score of sequence.

loadtxt(data): Extracts look-up tables and weights from txt file.

nucleotide_frequencies(sequence): Calculates nucleotide frequencies of sequence for every codon, returning a list of four values (corresponding to A, C, G, T, respectively).

position_values(sequence): Calculates position values of sequence for every codon, returning a list of four values (corresponding to A, C, G, T, respectively). Describes bias of nucleotides occurrence at specific codon positions.

savetxt(filepath): Saves look-up tables and weights to txt file.

General

Feature extractors for general features.

class lncrnapy.features.general.Complexity

Calculates the (local) compositional complexity (entropy) of a transcript sequence.

calculate(data): Calculates local compositional complexity of all rows in data.

calculate_per_sequence(sequence): Calculates the complexity for a given sequence.

class lncrnapy.features.general.Entropy(new_feature_name, feature_names)

Calculates the shannon entropy of specific features of a sequence.

`feature_names`

Names of the features for which the entropy should be calculated.

Type:: list[str]

`name`

Name of the combined entropy feature calculated by this class.

Type:: str

calculate(data): Calculates the entropy of features for every row in Data.

class lncrnapy.features.general.EntropyDensityProfile(feature_names)

Calculates the Entropy Density Profile (EDP) as utilized by LncADeep.

`feature_names`

Names of the features for which the entropy should be calculated.

Type:: list[str]

`name`

Names of entropy density columns calculated by this class.

Type:: str

calculate(data): Calculates the EDP for every row in data.

calculate_edp(values): Calculates the EDP for given list of values

class lncrnapy.features.general.GCContent(apply_to='sequence')

Calculates the proportion of bases that are either Guanine or Cytosine.

`apply_to`

Indicates which column this class extracts its features from.

Type:: str

`name`

Name of the feature calculated by this object (‘GC content’)

Type:: str

calculate(data): Calculates GC content for every row in data.

calculate_per_sequence(sequence): Calculates the GC content for a given sequence.

class lncrnapy.features.general.Length

Calculates lengths of sequences.

`name`

Column name for sequence length.

Type:: ‘length’

class lncrnapy.features.general.Quality

Calculates the ratio of uncertain bases (bases other than ACGT) per sequence.

`name`

Name of the feature calculated by this class (‘quality’).

Type:: str

calculate(data): Calculates the quality for all sequences in data.

class lncrnapy.features.general.SequenceDistribution(apply_to='sequence', vocabulary=['A', 'C', 'G', 'T'])

For every word in a given vocabulary, calculate the percentage that is contained within every quarter of the total length of the sequence. Loosely based on the D (disrtibution) feature of CTD as proposed by CPPred.

`vocabulary`

Words of equal length k for which to determine distribution for.

Type:: dict[str:int]

`k`

Inferred lenght of the words in the vocabulary.

Type:: int

`apply_to`

Indicates which column this class extracts its features from.

Type:: str

`name`

List of column names of feature calculated by this class.

Type:: list[str]

References

CPPred: Tong et al. (2019) https://doi.org/10.1093/nar/gkz087

calculate(data): Calculates the sequence distribution for every row in data.

calculate_per_sequence(sequence): Calculates the distriution of a given sequence.

class lncrnapy.features.general.StdStopCodons(apply_to='sequence')

Calculates the standard deviations of stop codon counts between three reading frames, as formulated by lncRScan-SVM.

`apply_to`

Indicates which column this class extracts its features from.

Type:: str

`name`

Column name of feature calculated by this class (‘SCS’)

Type:: str

References

lncRScan-SVM: Sun et al. (2015) https://doi.org/10.1371/journal.pone.0139654

calculate(data): Calculates the std of stop codon counts for every row in data.

calculate_per_sequence(sequence): Calculates the std of stop codon counts for a given sequence.

K-mer

Feature extractors based on k-mer frequencies.

class lncrnapy.features.kmer.KmerBase(k, stride=1, alphabet='ACGT', gap_length=0, gap_pos=0, uncertain='')

Base class for k-mer-based feature extractors.

`k`

Length of to-be-generated nucleotide combinations in the vocabulary.

Type:: int

`stride`

Step size of sliding window during calculation.

Type:: int

`alphabet`

Alphabet of characters that the k-mers exist of (default is ‘ACGT’).

Type:: str

`gap_length`

Introduces a gap of specified length to k-mer sequences, position of this gap is controlled by gap_pos.

Type:: int

`gap_pos`: Introduces a gap at specified position to k-mer sequences, length of this gap is controlled by gap_length.

`uncertain`

Optional character that indicates any base that falls outside of ACGT.

Type:: str

`k-mers`

Dictionary containing k-mers (keys) and corresponding indices (values).

Type:: dict[str:int]

calculate_kmer_freqs(sequence): Calculates k-mer frequency spectrum for given sequence.

replace_uncertain_bases(sequence): Replaces non-ACGT bases in sequence with self.uncertain.

class lncrnapy.features.kmer.KmerDistance(data, k, dist_type, apply_to='sequence', stride=1, alphabet='ACGT', gap_length=0, gap_pos=0, export_path=None)

Calculates distance to average k-mer profiles of coding and non-coding RNA transcripts, as introduced by LncFinder. Also calculates the ratio of the two distances.

`k`

Length of to-be-generated nucleotide combinations in the vocabulary.

Type:: int

`pc_kmer_profile`

Average k-mer frequency spectrum of protein-coding transcripts.

Type:: np.ndarray

`nc_kmer_profile`

Average k-mer frequency spectrum of non-coding transcripts.

Type:: np.ndarray

`dist_type`

Whether to use euclididan or logarithmic distance.

Type:: ‘euc’|’log’

`apply_to`

Indicates which column this class extracts its features from.

Type:: str

`stride`

Step size of sliding window during calculation.

Type:: int

`alphabet`

Alphabet of characters that the k-mers exist of.

Type:: str

`gap_length`

Introduces a gap of specified length to k-mer sequences, position of this gap is controlled by gap_pos.

Type:: int

`gap_pos`: Introduces a gap at specified position to k-mer sequences, length of this gap is controlled by gap_length.

`uncertain`

Optional character that indicates any base that falls outside of ACGT.

Type:: str

`k-mers`

Dictionary containing k-mers (keys) and corresponding indices (values).

Type:: dict[str:int]

`name`

Column names for k-mer distance (to protein-/non-coding) (ratio) features.

Type:: list[str]

References

LncFinder: Han et al. (2019) https://doi.org/10.1093/bib/bby065

calculate(data): Calculates k-mer distance for every row in data.

class lncrnapy.features.kmer.KmerFreqs(k, apply_to='sequence', stride=1, alphabet='ACGT', PLEK=False, gap_length=0, gap_pos=0)

For every k-mer, calculate its occurrence frequency in the sequence divided by the total number of k-mers appearing in that sequence.

`k`

Length of to-be-generated nucleotide combinations in the vocabulary.

Type:: int

`scaling`

Scaling factor applied to every k-mer spectrum. Usually set to 1, unless PLEK argument was True at initialization.

Type:: float

`apply_to`

Indicates which column this class extracts its features from.

Type:: str

`stride`

Step size of sliding window during calculation.

Type:: int

`alphabet`

Alphabet of characters that the k-mers exist of (default is ‘ACGT’).

Type:: str

`gap_length`

Introduces a gap of specified length to k-mer sequences, position of this gap is controlled by gap_pos.

Type:: int

`gap_pos`: Introduces a gap at specified position to k-mer sequences, length of this gap is controlled by gap_length.

`uncertain`

Optional character that indicates any base that falls outside of ACGT.

Type:: str

`k-mers`

Dictionary containing k-mers (keys) and corresponding indices (values).

Type:: dict[str:int]

`name`

Column names for frequency features (= all k-mers).

Type:: list[str]

calculate(data): Calculates k-mer frequencies for every row in data.

class lncrnapy.features.kmer.KmerScore(data, k, apply_to='sequence', stride=1, alphabet='ACGT', export_path=None)

Calculates k-mer score, indicating how likely a sequence is to be protein-coding (the higher, the more likely). Sums the log-ratios of k-mer frequencies in protein-coding RNA over non-coding RNA. Introduced by CPAT as hexamer score or hexamer usage bias.

`k`

Length of to-be-generated nucleotide combinations in the vocabulary.

Type:: int

`kmer_freqs`

Log-ratios of k-mer frequencies in protein-coding RNA over non-coding RNA.

Type:: np.ndarray

`apply_to`

Indicates which column this class extracts its features from.

Type:: str

`stride`

Step size of sliding window during calculation.

Type:: int

`alphabet`

Alphabet of characters that the k-mers exist of (default is ‘ACGT’).

Type:: str

`uncertain`

Optional character that indicates any base that falls outside of ACGT.

Type:: str

`k-mers`

Dictionary containing k-mers (keys) and corresponding indices (values).

Type:: dict[str:int]

`name`

Column names for frequency features (= all k-mers).

Type:: list[str]

References

CPAT: Wang et al. (2013) https://doi.org/10.1093/nar/gkt006 FEELnc: Wucher et al. (2017) https://doi.org/10.1093/nar/gkw1306

calculate(data): Calculates k-mer score for every row in data.

calculate_per_sequence(sequence): Calculates k-mer score of sequence.

count_kmers(sequence): Returns an array of frequencies k-mer counts in sequence.

plot_bias(filepath=None, figsize=None): Plots log ratio of usage frequency of k-mers in pcRNA/ncRNA.

MLCDS

Feature extractors based on the Most-Like Coding Sequence (MLCDS)

class lncrnapy.features.mlcds.MLCDS(data, export_path=None)

Determines Most-Like Coding Sequence (MLCDS) coordinates based on Adjoined Nucleotide Triplets (ANT), as proposed by CNCI. Calculates six MLCDSs, based on two directions and three reading frames. The MLCDSs are sorted based on score, where MLCDS1 has the highest score.

`name`

Column names for MLCDS coordinates and scores.

Type:: list[str]

`kmers`

Dictionary containing 3-mers (keys) and corresponding indices (values).

Type:: dict[str:int]

`ant_matrix`

Adjoined Nucleotide Triplet matrix, containing the log-ratios of ANTs appearing in coding over non-coding RNA.

Type:: np.ndarray

References

CNCI: Sun et al. (2013) https://doi.org/10.1093/nar/gkt646 CNIT: Guo et al. (2019) https://doi.org/10.1093/nar/gkz400

calculate(data): Calculates MLCDS for every row in data.

calculate_ant_matrix(data): Calculates Adjoined Nucleotide Triplet matrix, containing the log-ratios of ANTs appearing in coding over non-coding RNA.

calculate_per_sequence(sequence): Calculates MLCDS of sequence.

get_abs_coordinates(coord1, coord2, dir, offset): Transforms a set of coordinates relative to a reading frame to a set of coordinates that are defined in relation to the original sequence.

get_mlcds(reading_frame)

Calculates Most-Like Coding Sequence (MLCDS) given reading frame, using the Adjoined Nucleotide Triplet (ANT) matrix.

Adapted from cal_score function in CNIT’s code by Fang Shuangsang.

get_reading_frame(sequence, dir, offset): Extract reading frame from sequence given direction and offset

imshow_ant_matrix(filepath=None, figsize=None, **kwargs)

Plots ANT matrix.

Parameters:

filepath (str) – If provided, will export figure to this filepath.
kwargs – Any kwargs accepted by matplotlib.pyplot.imshow.

class lncrnapy.features.mlcds.MLCDSLength

Lengths of Most-Like Coding Sequences.

`name`

Column name for MLCDS lengths

Type:: list[str]

calculate(data): Calculates MLCDS lengths for all rows in data.

class lncrnapy.features.mlcds.MLCDSLengthPercentage

Length percentages of Most-Like Coding Sequences. Defined as the length of the MLCDS with the highest score, divided by the sum of the lengths of the remaining MLCDSs.

`name`

Column name for MLCDS length percentage (‘MLCDS length-percentage’)

Type:: list[str]

calculate(data): Calculates length percentage for every row in data.

class lncrnapy.features.mlcds.MLCDSLengthStd

Standard deviation of Most-Like Coding Sequences lengths.

`name`

Column name for MLCDS length standard deviation (‘MLCDS length (std)’)

Type:: str

calculate(data): Calculates MLCDS length standard deivation for all rows in data.

class lncrnapy.features.mlcds.MLCDSScoreDistance

Score distance of Most-Like Coding Sequences. Defined as sum of the differences between the score of the highest MLCDS and that of the the remaining ones.

`name`

Column name for MLCDS score distance (‘MLCDS score-distance’)

Type:: str

calculate(data): Calculates score distance for every row in data.

class lncrnapy.features.mlcds.MLCDSScoreStd

Standard deviation of Most-Like Coding Sequences scores.

`name`

Column name for MLCDS score standard deviation (‘MLCDS score (std)’)

Type:: str

calculate(data): Calculates MLCDS score standard deivation for all rows in data.

MLM Accuracy

Evaluates a Masked Language Model and adds the per-sequence accuracy as feature to the data.

class lncrnapy.features.mlm_accuracy.MLMAccuracy(model, p_mlm=0.15, p_mask=0.8, p_random=0.1, mask_size=1)

Calculates per-sequence MLM accuracy.

calculate(data): Calculates MLM accuracy for every row in data.

ORF

Feature extractors that identify and rely on the Open Reading Frame (ORF) in a transcript.

class lncrnapy.features.orf.ORFAminoAcidFreqs

Calculates occurrence frequencies in the protein encoded by an ORF for all amino acids.

`amino_acids`

All 20 possible amino acid symbols.

Type:: str

`name`

List of column names for ORF amino acid frequencies.

Type:: list[str]

References

CONC: Blake et al. (2006) https://doi.org/10.1371/journal.pgen.0020029

calculate(data): Calculates ORF amino acid frequencies for all rows in data.

calculate_per_sequence(protein): Calculates ORF amino acid frequencies for a given protein.

class lncrnapy.features.orf.ORFCoordinates(min_length=75, relaxation=0)

Determines Open Reading Frame (ORF) coordinates, similar to NCBI’s ORFFinder (https://www.ncbi.nlm.nih.gov/orffinder/)

`name`

Column names for ORF coordinates (‘ORF (start)’, ‘ORF (end)’).

Type:: list[str]

`min_length`

Minimum required length for an ORF.

Type:: int

`relaxation`

Relaxation type of the ORF algorithm, as defined by FEELnc. * 0: Start and stop codon is required. * 1: Start codon is required. * 2: Stop codon is required. * 3: Start or stop codon is required. * 4: If no ORF found, use full-length transcript.

Type:: int

References

FEELnc: Wucher et al. (2017) https://doi.org/10.1093/nar/gkw1306

calculate(data): Calculates ORF for every row in data.

calculate_per_sequence(sequence): Returns start (incl.) and stop (excl.) position of longest ORF in sequence.

class lncrnapy.features.orf.ORFCoverage(relaxation=0)

Calculates ORF coverage (ORF length / sequence length).

`relaxation`

The relaxation level(s) of the ORFs for which this feature must be calculated (default is 0).

Type:: list`|`int

`name`

Column names for ORF length (‘ORF length’).

Type:: list[str]

calculate(data): Calculates ORF coverage for every row in data.

class lncrnapy.features.orf.ORFIsoelectric: Theoretical isoelectric point of the protein encoded by the ORF.

class lncrnapy.features.orf.ORFLength(relaxation=0)

Calculates length of Open Reading Frame (ORF) based on coordinates.

`relaxation`

The relaxation level(s) of the ORFs for which this feature must be calculated (default is 0).

Type:: list`|`int

`name`

Column names for ORF length (given relaxation type).

Type:: list[str]

calculate(data): Calculates ORF length for every row in data.

class lncrnapy.features.orf.ORFProtein

Translates ORF of transcript into amino acid sequence.

`name`

Column name for ORF protein (‘ORF protein’).

Type:: str

calculate(data): Calculates ORF protein for every row in data.

calculate_per_sequence(sequence): Translates a given sequence into a amino-acid sequence.

class lncrnapy.features.orf.ORFProteinAnalysis(features={'MW': <function ProteinAnalysis.molecular_weight>, 'aromaticity': <function ProteinAnalysis.aromaticity>, 'gravy': <function ProteinAnalysis.gravy>, 'helix': <function ORFProteinAnalysis.<lambda>>, 'instability': <function ProteinAnalysis.instability_index>, 'pI': <function ProteinAnalysis.isoelectric_point>, 'sheet': <function ORFProteinAnalysis.<lambda>>, 'turn': <function ORFProteinAnalysis.<lambda>>})

Calculates features for the protein encoded by the ORF using methods from Bio.SeqUtils.ProtParam.ProteinAnalysis.

`features`

Dictionary with to-be-calculated features, with names (str) as keys, and corresponding methods of ProteinAnalysis as values.

Type:: dict

`name`

Column names for ORF features (inferred from features).

Type:: list[str]

calculate(data): Calculates ORF protein feature(s) for all rows in data.

calculate_per_sequence(sequence): Calculates ORF protein feature(s) for a given amino acid sequence.

class lncrnapy.features.orf.UTRCoverage

Calculates the coverage of the 5’ and 3’ Untranslated Regions (UTRs).

`name`

Column names for 5’ and 3’ UTR coverage.

Type:: list[str]

calculate(data): Calculates the UTR coverage for every row in data.

class lncrnapy.features.orf.UTRLength

Calculates the length of the 5’ and 3’ Untranslated Regions (UTRs).

`name`

Column names for 5’ and 3’ UTR length.

Type:: list[str]

calculate(data): Calculates the lengths of the 5’/3’ UTRs for every row in data.

no_orf_to_nan(array): Sets array value to nan if value at index is -1.

lncrnapy.features.orf.orf_column_names(columns, relaxation)

Generate column names for the ORF features in columns, for a specific relaxation type.

Parameters:

columns (list[str]) – List of ORF feature names (e.g. ‘length’)
relaxation (list`|`int) – Relaxation type(s) for the to-be-generated columns.

Sequence Base

Contains base class for features that operate on data (sub)sequences.

class lncrnapy.features.sequence_base.SequenceBase(apply_to='sequence')

Base class for features that operate on data (sub)sequences, the type of which is specified by the apply_to attribute.

Sequence-based features in lncrnapy are not required to inherit from this class, but it does make them more versatile as it enables a single implementation to operate on multiple sequence types.

Most important methods are get_sequence and check_columns, which will behave differently for different apply_to settings.

`apply_to`

Indicates which column this class extracts its features from.

Type:: str

check_columns(data): Checks if the required columns, based on the apply_to attribute, are present within data.

SSE

Feature extractors based on Secondary Structure Elements (SSEs).

class lncrnapy.features.sse.SSE

Calculates Secondare Elements (SSEs) based on Minimum Free Energy (MFE), using the ViennaRNA package, as proposed by LNCFinder.

`name`

Names features calculated by this object (‘MFE’, ‘SSE’).

Type:: list[str]

References

LNCFinder: Han et al. (2018) https://doi.org/10.1093/bib/bby065

calculate(data): Calculates MFE and SSE for every row in data.

calculate_per_sequence(sequence): Calculates MFE and SSE for a given sequence.

class lncrnapy.features.sse.UPFrequency

Calculates the frequency of unpaired nucleotide bases (UP frequency), using the SSE, as proposed by LNCFinder.

`name`

Name of the feature as calculated by this object (‘UP freq.’)

Type:: str

calculate(data): Calculates the UP frequency for every row in data.

calculate_per_sequence(row): Calculates the UP frequency of a given data row.

lncrnapy.features.sse.get_hl_sse_sequence(data_row, type)

Returns high-level secondary structure-derived sequence, using the ‘sequence’ and ‘SSE’ columns of data_row.

Parameters:

data_row (pd.Series) – Row with ‘sequence’ and ‘SSE’ columns.
type ('acguD'|'acguS'|'acgu-ACGU') – Type of secondary structure-derived sequence.

References

LNCFinder: Han et al. (2018) https://doi.org/10.1093/bib/bby065

Standardizer

Contains the Standardizer class, expanding the StandardScaler from scikit-learn to match and work with the lncrnapy API.

class lncrnapy.features.standardizer.Standardizer(data, apply_to)

Integrates the StandardScaler object from scikit-learn with the lncrnapy API. This allows it to operate on Data objects and be considered a feature extractor. Furthermore, the inverse_transform method can be called during evaluation of a deep neural network, to get a realistic insight in what the model’s current error is.

calculate(data): Scales all rows in data.

inverse_transform(y): Scales back y into the original scaling.

Tokenizers

Tokenization methods required for deep learning language models.

class lncrnapy.features.tokenizers.BPELength(data, vocab_size=768, user_defined_symbols=None, max_sentence_length=8000, export_path=None)

Calculates the full Byte Pair Encoding length, without special tokens (e.g. CLS), assuming no context length cut-off.

calculate(data): Calculates theoretical BPE length for every row in data.

class lncrnapy.features.tokenizers.BPEPieces(data, vocab_size=768, user_defined_symbols=None, max_sentence_length=8000, export_path=None)

Calculates a piecewise representation of the sequence with Byte Pair Encoding, without converting the pieces to token indices.

calculate(data): Calculates the piecewise BPE sequence for all rows in data.

print(sequence, inspect_codon='ATG', line_chars=100): Prints a (piecewise BPE) sequence, highlighting the codon specified in inspect_codon.

class lncrnapy.features.tokenizers.BytePairEncoding(data, context_length=768, vocab_size=4096, user_defined_symbols=None, max_sentence_length=10000, export_path=None)

Byte Pair Encoding from sentenciepiece, applied to nucleotide sequences.

References

BPE: Sennrich et al. (2016) https://doi.org/10.18653/v1/P16-1162 DNABERT-2: Zhou et al. (2023) https://doi.org/10.48550/arXiv.2306.15006

calculate(data): Calculates the BPE tokens for all rows in data.

get_piece_length_stats(): Returns the avg, std, min, and max length of word pieces in the BPE vocabulary.

property vocab_size: The number of unique tokens known by the model.

class lncrnapy.features.tokenizers.KmerTokenizer(k, context_length=768)

Tokenizer based on k-mers, every k-mer is given its own token.

`k`

Length of k-mers.

Type:: int

calculate(data): Calculates the token representations of all sequences in data.

calculate_kmer_tokens(sequence): Tokenizes sequence.

class lncrnapy.features.tokenizers.TokenLocalization(tokenizer)

Adds special MASK token to tokenized input sequence, with the location of this MASK as new feature.

calculate(data): Adds MASK token and localization target to every row in data.

class lncrnapy.features.tokenizers.TokenizerBase(context_length, method_name)

Base class for tokenizers, only for some shared attributes.

`context_length`

Number of tokens this tokenizer generates per sample.

Type:: int

`vocab_size`

The number of unique tokens known by the model.

Type:: int

`tokens`

Mapping of sequences (or token indicators such as ‘CLS’) to the integer values that these tokens are represented by.

Type:: dict[str:int]

`name`

List of column names for the generated tokens.

Type:: list[str]

property vocab_size: The number of unique tokens known by the model.

lncrnapy.features.tokenizers.coverage_table(data, vocab_sizes, context_lengths, show_num_tokens=False): Creates a table that for ever combination of vocabulary size and context length, calculates the proportion of deprecated sequences. When show_num_tokens is True, will report the average number of skipped tokens.

lncrnapy.features.tokenizers.plot_bpe_lengths(data, vocab_sizes, upper=0.975, lower=0.025, filepath=None): Combined density plot of BPE encoding length for all vocab_sizes.

Zhang

‘Zhang nucleotide bias around start codon, as proposed by DeepCPP (Zhang et al. 2020)

class lncrnapy.features.zhang.ZhangScore(data, export_path=None)

Nucleotide bias around the start codon of the ORF, as proposed by DeepCPP.

`bias`

Array containing the nucleotide bias around the start codon.

Type:: np.ndarray

`name`

Column name of the feature calculated by this class (‘Zhang score’).

Type:: str

References

DeepCPP: Zhang et al. (2020) https://doi.org/10.1093/bib/bbaa039

calculate(data): Calculates the Zhang nucleotide bias score for every row in data.

calculate_per_sequence(sequence, orf_start): Calculates the Zhang nucleotide bias score for given sequence.